String Comparison in Python: Is Vs. ==

String comparison in Python: is vs. ==

For all built-in Python objects (like
strings, lists, dicts, functions,
etc.), if x is y, then x==y is also
True.

Not always. NaN is a counterexample. But usually, identity (is) implies equality (==). The converse is not true: Two distinct objects can have the same value.

Also, is it generally considered better to just use '==' by default, even
when comparing int or Boolean values?

You use == when comparing values and is when comparing identities.

When comparing ints (or immutable types in general), you pretty much always want the former. There's an optimization that allows small integers to be compared with is, but don't rely on it.

For boolean values, you shouldn't be doing comparisons at all. Instead of:

if x == True:
# do something

write:

if x:
# do something

For comparing against None, is None is preferred over == None.

I've always liked to use 'is' because
I find it more aesthetically pleasing
and pythonic (which is how I fell into
this trap...), but I wonder if it's
intended to just be reserved for when
you care about finding two objects
with the same id.

Yes, that's exactly what it's for.

Why does comparing strings using either '==' or 'is' sometimes produce a different result?

is is identity testing, == is equality testing. what happens in your code would be emulated in the interpreter like this:

>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False

so, no wonder they're not the same, right?

In other words: a is b is the equivalent of id(a) == id(b)

Comparing strings using '==' and 'is'

There are two ways to check for equality in Python: == and is. == will check the value, while is will check the identity. In almost every case, if is is true, then == must be true.

Sometimes, Python (specifically, CPython) will optimize values together so that they have the same identity. This is especially true for short strings. Python realizes that 'Hello' is the same as 'Hello' and since strings are immutable, they become the same through string interning / string pooling.

See a related question: Python: Why does ("hello" is "hello") evaluate as True?

Is there a difference between == and is?

is will return True if two variables point to the same object (in memory), == if the objects referred to by the variables are equal.

>>> a = [1, 2, 3]
>>> b = a
>>> b is a
True
>>> b == a
True

# Make a new copy of list `a` via the slice operator,
# and assign it to variable `b`
>>> b = a[:]
>>> b is a
False
>>> b == a
True

In your case, the second test only works because Python caches small integer objects, which is an implementation detail. For larger integers, this does not work:

>>> 1000 is 10**3
False
>>> 1000 == 10**3
True

The same holds true for string literals:

>>> "a" is "a"
True
>>> "aa" is "a" * 2
True
>>> x = "a"
>>> "aa" is x * 2
False
>>> "aa" is intern(x*2)
True

Please see this question as well.

Why is it faster to compare strings that match than strings that do not?

Combining my comment and the comment by @khelwood:

TL;DR:

When analysing the bytecode for the two comparisons, it reveals the 'time' and 'time' strings are assigned to the same object. Therefore, an up-front identity check (at C-level) is the reason for the increased comparison speed.

The reason for the same object assignment is that, as an implementation detail, CPython interns strings which contain only 'name characters' (i.e. alpha and underscore characters). This enables the object's identity check.


Bytecode:

import dis

In [24]: dis.dis("'time'=='time'")
1 0 LOAD_CONST 0 ('time') # <-- same object (0)
2 LOAD_CONST 0 ('time') # <-- same object (0)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE

In [25]: dis.dis("'time'=='1234'")
1 0 LOAD_CONST 0 ('time') # <-- different object (0)
2 LOAD_CONST 1 ('1234') # <-- different object (1)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE

Assignment Timing:

The 'speed-up' can also be seen in using assignment for the time tests. The assignment (and compare) of two variables to the same string, is faster than the assignment (and compare) of two variables to different strings. Further supporting the hypothesis the underlying logic is performing an object comparison. This is confirmed in the next section.

In [26]: timeit.timeit("x='time'; y='time'; x==y", number=1000000)
Out[26]: 0.0745926329982467

In [27]: timeit.timeit("x='time'; y='1234'; x==y", number=1000000)
Out[27]: 0.10328884399496019

Python source code:

As helpfully provided by @mkrieger1 and @Masklinn in their comments, the source code for unicodeobject.c performs a pointer comparison first and if True, returns immediately.

int
_PyUnicode_Equal(PyObject *str1, PyObject *str2)
{
assert(PyUnicode_CheckExact(str1));
assert(PyUnicode_CheckExact(str2));
if (str1 == str2) { // <-- Here
return 1;
}
if (PyUnicode_READY(str1) || PyUnicode_READY(str2)) {
return -1;
}
return unicode_compare_eq(str1, str2);
}

Appendix:

  • Reference answer nicely illustrating how to read the disassembled bytecode output. Courtesy of @Delgan
  • Reference answer which nicely describes CPython's string interning. Coutresy of @ShadowRanger

Should I use == for string comparison?

TL;DR yes it's vulnerable! However, you should still use == for comparison because that's the best thing there is.


Whether or not the implementation of str.__eq__() is vulnerable to timing attacks is easy to verify. Let's define four strings like so:

import random

# Lots of random characters from A to Z
s1 = ''.join(chr(random.randint(65, 90)) for _ in range(1000000))


s1c = s1 # This string is equal and at the same memory location
s2 = ''.join(c for c in s1) # This string is equal but not at the same memory loc
s3 = s1[:-1] + "?" # This is not equal because of a mismatch at the end
s4 = "?" + s1[1:] # This is not equal because of a mismatch at the start
s5 = s1[:-1000] # This is not equal because of mismatched lengths

To time the equality check, we can use the timeit module.

import timeit

t1_1c = timeit.timeit('s1 == s1c', 'from __main__ import s1, s1c', number=10000)
t1_2 = timeit.timeit('s1 == s2', 'from __main__ import s1, s2', number=10000)
t1_3 = timeit.timeit('s1 == s3', 'from __main__ import s1, s3', number=10000)
t1_4 = timeit.timeit('s1 == s4', 'from __main__ import s1, s4', number=10000)
t1_5 = timeit.timeit('s1 == s5', 'from __main__ import s1, s5', number=10000)

I get the following numbers:































VariableValue
t1_1c0.0003349999997226405
t1_20.7978945999993812
t1_30.7638719000005949
t1_40.0011733000001186156
t1_50.0003372000001036213

Super fast way to compare if two strings are equal

(EDITED: to improve overall quality).

Considering how str == str is implemented in Python, this first gets an id() check, length check and then goes on element by element.
This is quite fast and understandably so, since a lot of Python code relies on this.
In the average case, there is no need for further optimization as arbitrary strings will be different quite early.

However, there are two use cases where there is some room for optimization:

  • you have some partial information on how the two inputs are going to be different.
  • you perform multiple comparisons among a certain set of elements (see @wim comments).

An example of the first situation is: if you know that when two inputs of the same size are different, they are likely different for at least m contiguous elements, then performing a comparison every k elements with k < m will be a reasonable bet, e.g.:

def is_k_equal(a, b, k=4096):
if k in {0, 1}:
return a == b
else:
return a[::k] == b[::k]


def is_equal_partial(a, b, partial_match=is_k_equal):
return len(a) == len(b) and partial_match(a, b) and a == b

An example of the second situation is: if you want to know which p inputs out of q are pairwise equal, it may be beneficial to compute a hash (for example using hash(), but other options may be equally valid) of your inputs and only perform a full comparison when the hashes match.
It goes without saying that if your hash has high collision rating, it may just introduce additional overhead (see Wikipedia for information on hashing).
The hashes of the input could be either manually managed, or you could guard your full comparison with a hash comparison in a is_equal() function, e.g.:

def is_equal_hashing(a, b, hashing=hash):
return len(a) == len(b) and hashing(a) == hashing(b) and a == b

provided that your hashing function is memoized. For hash() you do not need to do anything else, as this is already memoized for these inputs.
If you were to use a fancier hashing (e.g. crc32, md5, etc.), you may need to add memoization yourself, e.g with @functools.lru_cache.
The condition for this use-case seeing benefits from this approach (ignoring the time required to compare hashes which is usually much faster then the other operations to be considered) is:

t_h * n_h < t_c * n_c

with t_h the initial hash computation time, n_h the number of unique hashes to compute, t_c the comparison time, and n_c the number of full comparisons which fail near the end of the inputs.

When in doubt on how things will perform on your input, it is typically a good idea to measure / profile your code.

Care must be taken when timing memoized functions (like hash()), because, if you are interested in the performance of the non-memoized path, you cannot rely on timings of multiple repeated calls of the same input as it is typically done, for example with IPython's %timeit using default parameters. Instead, you may use %timeit -n1 -r1 for un-cached timings. The results would only be useful for order of magnitude estimates.

To give you some ideas on how fast the possible ingredients of your approach are, here are some micro-benchmarks:

import hashlib
import functools


def md5(data):
return hashlib.md5(data).digest()


@funtools.lru_cache(maxsize=16384)
def sha1(data):
return hashlib.sha1(data).digest()


def sha256(data):
return hashlib.sha1(data).digest()


def sha512(data):
return hashlib.sha1(data).digest()
import numpy as np
import numba as nb


@nb.jit(fastmath=True)
def hash_sum_nb(data, init=0):
dtype = np.uint64
nbytes = 8
n = len(data)
offset = n % nbytes
result = init
if offset:
body = np.frombuffer(data[:-offset], dtype=dtype)
tail = np.frombuffer(data[-offset:], dtype=np.uint8)
for x in tail:
result += x
else:
body = np.frombuffer(data, dtype=dtype)
for x in body:
result += x
return result + n
import zlib
import string
import random


n = 1_000_000
s = b''.join(string.printable[random.randrange(len(string.printable))].encode() for _ in range(n))
funcs = hash, hash, zlib.crc32, zlib.adler32, md5, sha1, sha1, sha256, sha512, hash_sum_nb
for func in funcs:
result = %timeit -n1 -r1 -q -o func(s)
print(f'{func.__name__:>12s} {result.best * 1e6:.3f} µs')
# hash 586.401 µs
# hash 0.853 µs
# crc32 976.128 µs
# adler32 468.452 µs
# md5 1790.659 µs
# sha1 1362.948 µs
# sha1 1.423 µs
# sha256 1347.432 µs
# sha512 1321.981 µs
# hash_sum_nb 64.768 µs


cases = {
'worst case': (s[:-1] + b'x', s[:-1] + b'y'),
'best case*': (s[:-1], s[:-2]),
'best case': (b'x' + s[:-1], b'y' + s[:-1]),
}
for name, (a, b) in cases.items():
result = %timeit -n1 -r1 -q -o a == b
print(f'str == str ({name:10s}) {result.best * 1e6:.3f} µs')
# str == str (worst case) 142.466 µs
# str == str (best case*) 0.856 µs
# str == str (best case ) 1.012 µs


a, b = (s[:-1] + b'x', s[:-1] + b'y')
result = %timeit -n1 -r1 -q -o is_k_equal(a, b)
print(f'{is_k_equal.__name__:>12s} {result.best * 1e6:.3f} µs')
# is_k_equal 10.037 µs

Note that both hash() and sha1() are called twice on the same input to show the effects of memoization.

With this data (or similar numbers that you could produce on your input / system), it may be possible to craft a more performant string equality comparison.
Note that throughout the answer I used bytes instead.
The timings for str would generally be worse for most hashing, because of the additional overhead required to handle the encoding, with the notable exception of hash().


Differences among string comparisons in python, c++, c# and java

Since Java does not allow operator overloading, they had to resort to creating a function (Equals) to compare for 'true' objects equality - and leave operator == to perform pointer comparison. This choice can not be really justified by any other reason, as it warrants for illogical code, more typing in generalized case (people usually compare for true equality, not pointer equality) and steeper learning curve.

C++ with a clear distinction between pointer and an object is not constraint by Java limitations, and thus allows proper value-semantics for classes and intuitive forms of comparison.

How are strings compared?

From the docs:

The comparison uses lexicographical
ordering: first the first two items
are compared, and if they differ this
determines the outcome of the
comparison; if they are equal, the
next two items are compared, and so
on, until either sequence is
exhausted.

Also:

Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

or on Python 2:

Lexicographical ordering for strings uses the ASCII ordering for individual characters.

As an example:

>>> 'abc' > 'bac'
False
>>> ord('a'), ord('b')
(97, 98)

The result False is returned as soon as a is found to be less than b. The further items are not compared (as you can see for the second items: b > a is True).

Be aware of lower and uppercase:

>>> [(x, ord(x)) for x in abc]
[('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
>>> [(x, ord(x)) for x in abc.upper()]
[('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]


Related Topics



Leave a reply



Submit