is' operator behaves differently when comparing strings with spaces
Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is
==bad idea.
Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:
Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)
An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.
Single characters are unique.
Examples
Alphanumeric string literals always share memory:
>>> x='Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaaaa'
>>> y='Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaaaa'
>>> x is y
True
Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:
(interpreter)
>>> x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a';
>>> z='`!@#$%^&*() \][=-. >:"?<a';
>>> x is y
True
>>> x is z
False
(file)
x='`!@#$%^&*() \][=-. >:"?<a';
y='`!@#$%^&*() \][=-. >:"?<a';
z=(lambda : '`!@#$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)
Output: True
and False
For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:
>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'Is' Operator Behaves Differently When Comparing Strings with SpacesIs' Operator Behaves Differently When Comparing Strings with Spacesaaaaa' is 'Is' Operator Behaves Differently When Comparing Strings with Spaces' + 'Is' Operator Behaves Differently When Comparing Strings with Spacesaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False
Single characters always share memory, of course:
>>> chr(0x20) is ' '
True
Why does a space affect the identity comparison of equal strings?
The python interpreter caches some strings based on certain criteria, the first abc
string is cached and used for both but the second is not. It is the same for small ints from -5
to 256
.
Because the strings are interned/cached assigning a
and b
to "abc"
makes a
and b
point to the same objects in memory so using is
, which checks if two objects are actually the same object, returns True
.
The second string abc abc
is not cached so they are two entirely different object in memory so out identity check using is
returns False
. This time a
is not b
. They are both pointing to different objects in memory.
In [43]: a = "abc" # python caches abc
In [44]: b = "abc" # it reuses the object when assigning to b
In [45]: id(a)
Out[45]: 139806825858808 # same id's, same object in memory
In [46]: id(b)
Out[46]: 139806825858808
In [47]: a = 'abc abc' # not cached
In [48]: id(a)
Out[48]: 139806688800984
In [49]: b = 'abc abc'
In [50]: id(b) # different id's different objects
Out[50]: 139806688801208
The criteria for caching strings is if the string only has letters, underscores and numbers in the string so in your case the space does not meet the criteria.
Using the interpreter there is one case where you can end up pointing to the same object even when the string does not meet the above criteria, multiple assignments.
In [51]: a,b = 'abc abc','abc abc'
In [52]: id(a)
Out[52]: 139806688801768
In [53]: id(b)
Out[53]: 139806688801768
In [54]: a is b
Out[54]: True
Looking codeobject.c source for deciding the criteria we see NAME_CHARS
decides what can be interned:
#define NAME_CHARS \
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */
static int
all_name_chars(unsigned char *s)
{
static char ok_name_char[256];
static unsigned char *name_chars = (unsigned char *)NAME_CHARS;
if (ok_name_char[*name_chars] == 0) {
unsigned char *p;
for (p = name_chars; *p; p++)
ok_name_char[*p] = 1;
}
while (*s) {
if (ok_name_char[*s++] == 0)
return 0;
}
return 1;
}
A string of length 0 or 1 will always be shared as we can see in the PyString_FromStringAndSize
function in the stringobject.c source.
/* share short strings */
if (size == 0) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
nullstring = op;
Py_INCREF(op);
} else if (size == 1 && str != NULL) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
characters[*str & UCHAR_MAX] = op;
Py_INCREF(op);
}
return (PyObject *) op;
}
Not directly related to the question but for those interested PyCode_New
also from the codeobject.c
source shows how more strings are interned when building a codeobject once the strings meet the criteria in all_name_chars
.
PyCodeObject *
PyCode_New(int argcount, int nlocals, int stacksize, int flags,
PyObject *code, PyObject *consts, PyObject *names,
PyObject *varnames, PyObject *freevars, PyObject *cellvars,
PyObject *filename, PyObject *name, int firstlineno,
PyObject *lnotab)
{
PyCodeObject *co;
Py_ssize_t i;
/* Check argument types */
if (argcount < 0 || nlocals < 0 ||
code == NULL ||
consts == NULL || !PyTuple_Check(consts) ||
names == NULL || !PyTuple_Check(names) ||
varnames == NULL || !PyTuple_Check(varnames) ||
freevars == NULL || !PyTuple_Check(freevars) ||
cellvars == NULL || !PyTuple_Check(cellvars) ||
name == NULL || !PyString_Check(name) ||
filename == NULL || !PyString_Check(filename) ||
lnotab == NULL || !PyString_Check(lnotab) ||
!PyObject_CheckReadBuffer(code)) {
PyErr_BadInternalCall();
return NULL;
}
intern_strings(names);
intern_strings(varnames);
intern_strings(freevars);
intern_strings(cellvars);
/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!PyString_Check(v))
continue;
if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
continue;
PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
This answer is based on simple assignments using the cpython interpreter, as far as interning in relation to functions or any other functionality outside of simple assignments, that was not asked nor answered.
If anyone with a greater understanding of c code has anything to add feel free to edit.
There is a much more thorough explanation here of the whole string interning.
python is operator behaviour with string
One important thing about this behavior is that Python caches some, mostly, short strings (usually less than 20 characters but not for every combination of them) so that they become quickly accessible. One important reason for that is that strings are widely used in Python's source code and it's an internal optimization to cache some special sorts of strings. Dictionaries are one of the generally used data structures in Python's source code that are used for preserving the variables, attributes, and namespaces in general, plus for some other purposes, and they all use strings as the object names. This is to say that every time you try to access an object attribute or have access to a variable (local or global) there's a dictionary lookup firing up internally.
Now, the reason that you got such bizarre behavior is that Python (CPython implementation) treats differently with strings in terms of interning. In Python's source code, there is a intern_string_constants function that gives strings the validation to be interned which you can check for more details. Or check this comprehensive article http://guilload.com/python-string-interning/.
It's also noteworthy that Python has an intern()
function in the sys
module that you can use to intern strings manually.
In [52]: b = sys.intern('a,,')
In [53]: c = sys.intern('a,,')
In [54]: b is c
Out[54]: True
You can use this function either when you want to fasten the dictionary lookups or when you're ought to use a particular string object frequently in your code.
Another point that you should not confuse with string interning is that when you do a == b
, you're creating two references to the same object which is obvious for those keywords to have the same id
.
Regarding punctuations, it seems that if they are one character they get interned if their length is more than one. If the length is more than one they won't get cached. As mentioned in the comments, one reason for that might be because it's less likely for keywords and dictionary keys to have punctuations in them.
In [28]: a = ','
In [29]: ',' is a
Out[29]: True
In [30]: a = 'abc,'
In [31]: 'abc,' is a
Out[31]: False
In [34]: a = ',,'
In [35]: ',,' is a
Out[35]: False
# Or
In [36]: a = '^'
In [37]: '^' is a
Out[37]: True
In [38]: a = '^%'
In [39]: '^%' is a
Out[39]: False
But still, these are just some speculations that you cannot rely on in your code.
Python: what difference between 'is' and '=='?
is
checks for identity. a is b
is True
iff a
and b
are the same object (they are both stored in the same memory address).
==
checks for equality, which is usually defined by the magic method __eq__
- i.e., a == b
is True
if a.__eq__(b)
is True
.
In your case specifically, Python optimizes the two hardcoded strings into the same object (since strings are immutable, there's no danger in that). Since input()
will create a string at runtime, it can't do that optimization, so a new string object is created.
What determines which strings are interned and when?
String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.
Confused about `is` operator with strings
I believe it has to do with string interning. In essence, the idea is to store only a single copy of each distinct string, to increase performance on some operations.
Basically, the reason why a is b
works is because (as you may have guessed) there is a single immutable string that is referenced by Python in both cases. When a string is large (and some other factors that I don't understand, most likely), this isn't done, which is why your second example returns False.
EDIT: And in fact, the odd behavior seems to be a side-effect of the interactive environment. If you take your same code and place it into a Python script, both a is b
and ktr is ptr
return True.
a="poi"
b="poi"
print a is b # Prints 'True'
ktr = "today is a fine day"
ptr = "today is a fine day"
print ktr is ptr # Prints 'True'
This makes sense, since it'd be easy for Python to parse a source file and look for duplicate string literals within it. If you create the strings dynamically, then it behaves differently even in a script.
a="p" + "oi"
b="po" + "i"
print a is b # Oddly enough, prints 'True'
ktr = "today is" + " a fine day"
ptr = "today is a f" + "ine day"
print ktr is ptr # Prints 'False'
As for why a is b
still results in True, perhaps the allocated string is small enough to warrant a quick search through the interned collection, whereas the other one is not?
Special characters in string in Python
You misunderstood what the is
operator tests. It tests if two variables point the same object, not if two variables have the same value.
From the documentation for the is operator:
The operators is and is not test for object identity: x is y is true if and only if x and y are the same object.
I suggest you use the equality operator(==
)
Here is how to use the equality operator for your scenario:
a2="+abc123"
b2="+abc123"
print(a2 == b2)
Output:
True
If you use the id()
operator witch Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for every object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
Test for b1's id
id(a2)
Output:
1600416721776
Test for b2's id
id(b2)
Output:
1600416732976
Using the id()
operator you can see that they do not point to the same object
It is worth a mention that the it is a bad idea to use the is
operator for strings because Alphanumeric string's always share memory and Non-alphanumeric string's share memory if and only if they share the same block(function, object, line, file):
Alphanumeric string's:
a='abc'
b='abc'
print('a is b: ' a is b)
Output:
a is b: True
non-Alphanumeric string's:
A='+abc123'; B='+abc123';
C='+abc123';
print('A is B: 'A is B)
print('A is C: 'A is C)
Output:
A is B: True
A is C: False
Python 'is' vs JavaScript ===
Python Part
SO can you have two different instances of (say) a string of "Bob" and
have them not return true when compared using 'is'? Or is it infact
the same as ===?
a = "Bob"
b = "{}".format("Bob")
print a, b
print a is b, a == b
Output
Bob Bob
False True
Note: In most of the Python implementations, compile time Strings are interned.
Another example,
print 3 is 2+1
print 300 is 200+100
Output
True
False
This is because, small ints (-5 to 256) in Python are cached internally. So, whenever they are used in the programs, the cached integers are used. So, is
will return True
for them. But if we choose bigger numbers, like in the second example, (300 is 200+100
) it is not True, because they are NOT cached.
Conclusion:
is
will return True
only when the objects being compared are the same object, which means they point to the same location in memory. (It solely depends on the python implementation to cache/intern objects. In that case, is
will return True
)
Rule of thumb:
NEVER use is
operator to check if two objects have the same value.
JavaScript Part
Other part of your question is about === operator. Lets see how that operator works.
Quoting from ECMA 5.1 Specs, The Strict Equality Comparison Algorithm is defined like this
- If Type(x) is different from Type(y), return false.
- If Type(x) is Undefined, return true.
- If Type(x) is Null, return true.
- If Type(x) is Number, then
- If x is NaN, return false.
- If y is NaN, return false.
- If x is the same Number value as y, return true.
- If x is +0 and y is −0, return true.
- If x is −0 and y is +0, return true.
- Return false.
- If Type(x) is String, then return true if x and y are exactly the
same sequence of characters (same length and same characters in
corresponding positions); otherwise, return false.- If Type(x) is Boolean, return true if x and y are both true or both
false; otherwise, return false.- Return true if x and y refer to the same object. Otherwise, return
false.
Final Conclusion
We can NOT compare Python's is
operator and JavaScript's ===
operator, because Python's is
operator does only the last item in the Strict Equality Comparison Algorithm.
7. Return true if x and y refer to the same object. Otherwise, return false.
Related Topics
Why Are Slice and Range Upper-Bound Exclusive
Python and Beautifulsoup Encoding Issues
What Is This Odd Colon Behavior Doing
Parameter Substitution for a SQLite "In" Clause
Error: 'Int' Object Is Not Subscriptable - Python
Dll Load Failed When Importing Pyqt5
Should I Call Close() After Urllib.Urlopen()
How to Set Folder Permissions in Windows
What Does the Term "Broadcasting" Mean in Pandas Documentation
Asyncio.Sleep() VS Time.Sleep()
Python, Https Get with Basic Authentication
Best Way to Format Integer as String with Leading Zeros
Importing from a Relative Path in Python
How to Set the Text/Value/Content of an 'Entry' Widget Using a Button in Tkinter