Why Does Comparing Strings Using Either '==' or 'Is' Sometimes Produce a Different Result

Why does comparing strings using either '==' or 'is' sometimes produce a different result?

is is identity testing, == is equality testing. what happens in your code would be emulated in the interpreter like this:

>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False

so, no wonder they're not the same, right?

In other words: a is b is the equivalent of id(a) == id(b)

Why is it faster to compare strings that match than strings that do not?

Combining my comment and the comment by @khelwood:

TL;DR:

When analysing the bytecode for the two comparisons, it reveals the 'time' and 'time' strings are assigned to the same object. Therefore, an up-front identity check (at C-level) is the reason for the increased comparison speed.

The reason for the same object assignment is that, as an implementation detail, CPython interns strings which contain only 'name characters' (i.e. alpha and underscore characters). This enables the object's identity check.


Bytecode:

import dis

In [24]: dis.dis("'time'=='time'")
1 0 LOAD_CONST 0 ('time') # <-- same object (0)
2 LOAD_CONST 0 ('time') # <-- same object (0)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE

In [25]: dis.dis("'time'=='1234'")
1 0 LOAD_CONST 0 ('time') # <-- different object (0)
2 LOAD_CONST 1 ('1234') # <-- different object (1)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE

Assignment Timing:

The 'speed-up' can also be seen in using assignment for the time tests. The assignment (and compare) of two variables to the same string, is faster than the assignment (and compare) of two variables to different strings. Further supporting the hypothesis the underlying logic is performing an object comparison. This is confirmed in the next section.

In [26]: timeit.timeit("x='time'; y='time'; x==y", number=1000000)
Out[26]: 0.0745926329982467

In [27]: timeit.timeit("x='time'; y='1234'; x==y", number=1000000)
Out[27]: 0.10328884399496019

Python source code:

As helpfully provided by @mkrieger1 and @Masklinn in their comments, the source code for unicodeobject.c performs a pointer comparison first and if True, returns immediately.

int
_PyUnicode_Equal(PyObject *str1, PyObject *str2)
{
assert(PyUnicode_CheckExact(str1));
assert(PyUnicode_CheckExact(str2));
if (str1 == str2) { // <-- Here
return 1;
}
if (PyUnicode_READY(str1) || PyUnicode_READY(str2)) {
return -1;
}
return unicode_compare_eq(str1, str2);
}

Appendix:

  • Reference answer nicely illustrating how to read the disassembled bytecode output. Courtesy of @Delgan
  • Reference answer which nicely describes CPython's string interning. Coutresy of @ShadowRanger

Struggling to understand a specific behaviour of the is-operator

In short:

== is for value equality and is is for reference equality (same as id(a)==id(b)). Python caches small objects(small ints, strs, etc) to save space (feature that has been since py2).

My original detailed answer with examples:

Because they are exactly the same!

is will return True if two variables point to the same object, you can check the id to see the truth!

Try this:

a = 'Test'
b = 'Test'
print(a is b)
print(id(a),id(b))

My output was:

True
140586094600464 140586094600464

So to save space Python will assign the pointer same location until a change is a made

Example:

a = 'Test'
b = 'Test'
print(a is b)
print(id(a),id(b))
a = 'Test'
b += 'Changed'
print(a is b)
print(id(a),id(b))
True
140586094600464 140586094600464
False
140586094600464 140585963428528

Once you make a change, strings being immutable will get new location in memory!

If this was something like list, which is mutable even if they are same they will get separate location, so changes can be made!

#mutable
a= [1,2]
b= [1,2]
print(a is b)
print(id(a),id(b))
a[0] = -1
b[1] = -2
print(a is b)
print(id(a),id(b))
False
140586430241096 140585963716680
False
140586430241096 140585963716680

Int eg:

a=100 
b=100
print(a is b)
print(id(a),id(b))
True
10917664 10917664

Why do two different equality tests between two strings of same length take different amounts of time

There is some pool of string literals in Python, as in many languages, in the second case both strings are, in fact, the same object, so they are compared by references, not by actual values.

Typical string comparison function in reference-based languages:

if (ref(a) == ref(b)) return true;
if (len(a) != len(b)) return false;
return compare_actual_data(a, b);

Python: Why operator is and == are sometimes interchangeable for strings?

Short strings are interned for efficiency, so will refer to the same object therefore is will be true.

This is an implementation detail in CPython, and is absolutely not to be relied on.

Unexpected result comparing strings with `==`

The problem you've encountered here is due to recycling (not the eco-friendly kind). When applying an operation to two vectors that requires them to be the same length, R often automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one. Your unexpected results are due to the fact that R recycles the vector c("p", "o") to be length 4 (length of the larger vector) and essentially converts it to c("p", "o", "p", "o"). If we compare c("p", "o", "p", "o") and c("p", "o", "l", "o") we can see we get the unexpected results of above:

c("p", "o", "p", "o") == c("p", "o", "l", "o")
#> [1] TRUE TRUE FALSE TRUE

It's not exactly clear to me why you would expect the result to be TRUE TRUE FALSE FALSE, as it's somewhat of an ambiguous comparison to compare a length-2 vector to a length-4 vector, and recycling the length-2 vector (which is what R is doing) seems to be the most reasonable default aside from throwing an error.

doesn't works when comparing argv[] strings

When you use ==, it compares the addresses, not the contents. You declared tempStr as a pointer to a string literal, and compared it to the same string literal. The compiler noticed that the literals are the same, so it used the same memory for both of them, and that made the addresses the same.

You can't count on this being true, it's a compiler optimization to combine these similar strings.

If you change the declaration to

char tempStr[] = "string";

you would not get the same result with ==. In this case, tempStr is a local array, while "string" is a static string literal, and they will be in different memory locations.



Related Topics



Leave a reply



Submit