Is' Operator Behaves Differently When Comparing Strings with Spaces

is' operator behaves differently when comparing strings with spaces

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.

Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:

  • Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)

  • An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.

  • Single characters are unique.

Examples

Alphanumeric string literals always share memory:

>>> x='Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaaaa'
>>> y='Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaaaa'
>>> x is y
True

Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:

(interpreter)

>>> x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a';
>>> z='`!@#$%^&*() \][=-. >:"?<a';
>>> x is y
True
>>> x is z
False

(file)

x='`!@#$%^&*() \][=-. >:"?<a';
y='`!@#$%^&*() \][=-. >:"?<a';
z=(lambda : '`!@#$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)

Output: True and False

For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:

>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaa' is 'Is' Operator Behaves Differently When Comparing Strings with Spaces' + 'Is' Operator Behaves Differently When Comparing Strings with Spacesaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False

Single characters always share memory, of course:

>>> chr(0x20) is ' '
True

Why does a space affect the identity comparison of equal strings?

The python interpreter caches some strings based on certain criteria, the first abc string is cached and used for both but the second is not. It is the same for small ints from -5 to 256.

Because the strings are interned/cached assigning a and b to "abc" makes a and b point to the same objects in memory so using is, which checks if two objects are actually the same object, returns True.

The second string abc abc is not cached so they are two entirely different object in memory so out identity check using is returns False. This time a is not b. They are both pointing to different objects in memory.

In [43]: a = "abc" # python caches abc
In [44]: b = "abc" # it reuses the object when assigning to b
In [45]: id(a)
Out[45]: 139806825858808 # same id's, same object in memory
In [46]: id(b)
Out[46]: 139806825858808
In [47]: a = 'abc abc' # not cached
In [48]: id(a)
Out[48]: 139806688800984
In [49]: b = 'abc abc'
In [50]: id(b) # different id's different objects
Out[50]: 139806688801208

The criteria for caching strings is if the string only has letters, underscores and numbers in the string so in your case the space does not meet the criteria.

Using the interpreter there is one case where you can end up pointing to the same object even when the string does not meet the above criteria, multiple assignments.

In [51]: a,b  = 'abc abc','abc abc'

In [52]: id(a)
Out[52]: 139806688801768

In [53]: id(b)
Out[53]: 139806688801768

In [54]: a is b
Out[54]: True

Looking codeobject.c source for deciding the criteria we see NAME_CHARS decides what can be interned:

#define NAME_CHARS \
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */

static int
all_name_chars(unsigned char *s)
{
static char ok_name_char[256];
static unsigned char *name_chars = (unsigned char *)NAME_CHARS;

if (ok_name_char[*name_chars] == 0) {
unsigned char *p;
for (p = name_chars; *p; p++)
ok_name_char[*p] = 1;
}
while (*s) {
if (ok_name_char[*s++] == 0)
return 0;
}
return 1;
}

A string of length 0 or 1 will always be shared as we can see in the PyString_FromStringAndSize function in the stringobject.c source.

/* share short strings */
if (size == 0) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
nullstring = op;
Py_INCREF(op);
} else if (size == 1 && str != NULL) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
characters[*str & UCHAR_MAX] = op;
Py_INCREF(op);
}
return (PyObject *) op;
}

Not directly related to the question but for those interested PyCode_New also from the codeobject.c source shows how more strings are interned when building a codeobject once the strings meet the criteria in all_name_chars.

PyCodeObject *
PyCode_New(int argcount, int nlocals, int stacksize, int flags,
PyObject *code, PyObject *consts, PyObject *names,
PyObject *varnames, PyObject *freevars, PyObject *cellvars,
PyObject *filename, PyObject *name, int firstlineno,
PyObject *lnotab)
{
PyCodeObject *co;
Py_ssize_t i;
/* Check argument types */
if (argcount < 0 || nlocals < 0 ||
code == NULL ||
consts == NULL || !PyTuple_Check(consts) ||
names == NULL || !PyTuple_Check(names) ||
varnames == NULL || !PyTuple_Check(varnames) ||
freevars == NULL || !PyTuple_Check(freevars) ||
cellvars == NULL || !PyTuple_Check(cellvars) ||
name == NULL || !PyString_Check(name) ||
filename == NULL || !PyString_Check(filename) ||
lnotab == NULL || !PyString_Check(lnotab) ||
!PyObject_CheckReadBuffer(code)) {
PyErr_BadInternalCall();
return NULL;
}
intern_strings(names);
intern_strings(varnames);
intern_strings(freevars);
intern_strings(cellvars);
/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!PyString_Check(v))
continue;
if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
continue;
PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}

This answer is based on simple assignments using the cpython interpreter, as far as interning in relation to functions or any other functionality outside of simple assignments, that was not asked nor answered.

If anyone with a greater understanding of c code has anything to add feel free to edit.

There is a much more thorough explanation here of the whole string interning.

python is operator behaviour with string

One important thing about this behavior is that Python caches some, mostly, short strings (usually less than 20 characters but not for every combination of them) so that they become quickly accessible. One important reason for that is that strings are widely used in Python's source code and it's an internal optimization to cache some special sorts of strings. Dictionaries are one of the generally used data structures in Python's source code that are used for preserving the variables, attributes, and namespaces in general, plus for some other purposes, and they all use strings as the object names. This is to say that every time you try to access an object attribute or have access to a variable (local or global) there's a dictionary lookup firing up internally.

Now, the reason that you got such bizarre behavior is that Python (CPython implementation) treats differently with strings in terms of interning. In Python's source code, there is a intern_string_constants function that gives strings the validation to be interned which you can check for more details. Or check this comprehensive article http://guilload.com/python-string-interning/.

It's also noteworthy that Python has an intern() function in the sys module that you can use to intern strings manually.

In [52]: b = sys.intern('a,,')

In [53]: c = sys.intern('a,,')

In [54]: b is c
Out[54]: True

You can use this function either when you want to fasten the dictionary lookups or when you're ought to use a particular string object frequently in your code.

Another point that you should not confuse with string interning is that when you do a == b, you're creating two references to the same object which is obvious for those keywords to have the same id.

Regarding punctuations, it seems that if they are one character they get interned if their length is more than one. If the length is more than one they won't get cached. As mentioned in the comments, one reason for that might be because it's less likely for keywords and dictionary keys to have punctuations in them.

In [28]: a = ','

In [29]: ',' is a
Out[29]: True

In [30]: a = 'abc,'

In [31]: 'abc,' is a
Out[31]: False

In [34]: a = ',,'

In [35]: ',,' is a
Out[35]: False

# Or

In [36]: a = '^'

In [37]: '^' is a
Out[37]: True

In [38]: a = '^%'

In [39]: '^%' is a
Out[39]: False

But still, these are just some speculations that you cannot rely on in your code.

Python: what difference between 'is' and '=='?

is checks for identity. a is b is True iff a and b are the same object (they are both stored in the same memory address).

== checks for equality, which is usually defined by the magic method __eq__ - i.e., a == b is True if a.__eq__(b) is True.

In your case specifically, Python optimizes the two hardcoded strings into the same object (since strings are immutable, there's no danger in that). Since input() will create a string at runtime, it can't do that optimization, so a new string object is created.

What determines which strings are interned and when?

String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.

Confused about `is` operator with strings

I believe it has to do with string interning. In essence, the idea is to store only a single copy of each distinct string, to increase performance on some operations.

Basically, the reason why a is b works is because (as you may have guessed) there is a single immutable string that is referenced by Python in both cases. When a string is large (and some other factors that I don't understand, most likely), this isn't done, which is why your second example returns False.

EDIT: And in fact, the odd behavior seems to be a side-effect of the interactive environment. If you take your same code and place it into a Python script, both a is b and ktr is ptr return True.

a="poi"
b="poi"
print a is b # Prints 'True'

ktr = "today is a fine day"
ptr = "today is a fine day"
print ktr is ptr # Prints 'True'

This makes sense, since it'd be easy for Python to parse a source file and look for duplicate string literals within it. If you create the strings dynamically, then it behaves differently even in a script.

a="p" + "oi"
b="po" + "i"
print a is b # Oddly enough, prints 'True'

ktr = "today is" + " a fine day"
ptr = "today is a f" + "ine day"
print ktr is ptr # Prints 'False'

As for why a is b still results in True, perhaps the allocated string is small enough to warrant a quick search through the interned collection, whereas the other one is not?

Special characters in string in Python

You misunderstood what the is operator tests. It tests if two variables point the same object, not if two variables have the same value.

From the documentation for the is operator:

The operators is and is not test for object identity: x is y is true if and only if x and y are the same object.

I suggest you use the equality operator(==)

Here is how to use the equality operator for your scenario:

a2="+abc123"
b2="+abc123"

print(a2 == b2)

Output:

True

If you use the id() operator witch Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for every object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

Test for b1's id

id(a2)

Output:

1600416721776

Test for b2's id

id(b2)

Output:

1600416732976

Using the id() operator you can see that they do not point to the same object

It is worth a mention that the it is a bad idea to use the is operator for strings because Alphanumeric string's always share memory and Non-alphanumeric string's share memory if and only if they share the same block(function, object, line, file):

Alphanumeric string's:

a='abc'
b='abc'
print('a is b: ' a is b)

Output:

a is b: True

non-Alphanumeric string's:

A='+abc123'; B='+abc123';
C='+abc123';
print('A is B: 'A is B)
print('A is C: 'A is C)

Output:

A is B: True 

A is C: False

Python 'is' vs JavaScript ===

Python Part

SO can you have two different instances of (say) a string of "Bob" and
have them not return true when compared using 'is'? Or is it infact
the same as ===?

a = "Bob"
b = "{}".format("Bob")
print a, b
print a is b, a == b

Output

Bob Bob
False True

Note: In most of the Python implementations, compile time Strings are interned.

Another example,

print 3 is 2+1
print 300 is 200+100

Output

True
False

This is because, small ints (-5 to 256) in Python are cached internally. So, whenever they are used in the programs, the cached integers are used. So, is will return True for them. But if we choose bigger numbers, like in the second example, (300 is 200+100) it is not True, because they are NOT cached.

Conclusion:

is will return True only when the objects being compared are the same object, which means they point to the same location in memory. (It solely depends on the python implementation to cache/intern objects. In that case, is will return True)

Rule of thumb:

NEVER use is operator to check if two objects have the same value.


JavaScript Part

Other part of your question is about === operator. Lets see how that operator works.

Quoting from ECMA 5.1 Specs, The Strict Equality Comparison Algorithm is defined like this

  1. If Type(x) is different from Type(y), return false.
  2. If Type(x) is Undefined, return true.
  3. If Type(x) is Null, return true.
  4. If Type(x) is Number, then

    1. If x is NaN, return false.
    2. If y is NaN, return false.
    3. If x is the same Number value as y, return true.
    4. If x is +0 and y is −0, return true.
    5. If x is −0 and y is +0, return true.
    6. Return false.
  5. If Type(x) is String, then return true if x and y are exactly the
    same sequence of characters (same length and same characters in
    corresponding positions); otherwise, return false.
  6. If Type(x) is Boolean, return true if x and y are both true or both
    false; otherwise, return false.
  7. Return true if x and y refer to the same object. Otherwise, return
    false.

Final Conclusion

We can NOT compare Python's is operator and JavaScript's === operator, because Python's is operator does only the last item in the Strict Equality Comparison Algorithm.

7. Return true if x and y refer to the same object. Otherwise, return false.


Related Topics



Leave a reply



Submit