Does Python intern strings?
This is called interning, and yes, Python does do this to some extent, for shorter strings created as string literals. See About the changing id of an immutable string for some discussion.
Interning is runtime dependent, there is no standard for it. Interning is always a trade-off between memory use and the cost of checking if you are creating the same string. There is the sys.intern()
function to force the issue if you are so inclined, which documents some of the interning Python does for you automatically:
Note that Python 2 theNormally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
intern()
function used to be a built-in, no import necessary. Python string interning
This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.
In what follows CPython 3.9.0+ is used.
In the second example, the expression "strin"+"g"
is evaluated at compile time, and is replaced with "string"
. This makes the first two examples behave the same.
If we examine the bytecodes, we'll see that they are exactly the same:
# s1 = "string"
1 0 LOAD_CONST 0 ('string')
2 STORE_NAME 0 (s1)
# s2 = "strin" + "g"
2 4 LOAD_CONST 0 ('string')
6 STORE_NAME 1 (s2)
This bytecode was obtained with (which prints a few more lines after the above):import dis
source = 's1 = "string"\ns2 = "strin" + "g"'
code = compile(source, '', 'exec')
print(dis.dis(code))
The third example involves a run-time concatenation, the result of which is not automatically interned: # s3a = "strin"
3 8 LOAD_CONST 1 ('strin')
10 STORE_NAME 2 (s3a)
# s3 = s3a + "g"
4 12 LOAD_NAME 2 (s3a)
14 LOAD_CONST 2 ('g')
16 BINARY_ADD
18 STORE_NAME 3 (s3)
20 LOAD_CONST 3 (None)
22 RETURN_VALUE
This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):import dis
source = (
's1 = "string"\n'
's2 = "strin" + "g"\n'
's3a = "strin"\n'
's3 = s3a + "g"')
code = compile(source, '', 'exec')
print(dis.dis(code))
If you were to manually sys.intern()
the result of the third expression, you'd get the same object as before:>>> import sys
>>> s3a = "strin"
>>> s3 = s3a + "g"
>>> s3 is "string"
False
>>> sys.intern(s3) is "string"
True
Also, Python 3.9 prints a warning for the last two statements above:SyntaxWarning: "is" with a literal. Did you mean "=="?
What does sys.intern() do and when should it be used?
From the Python 3 documentation:
sys.intern(string)
Clarification:Enter string in the table of “interned” strings and return the
interned string – which is string itself or a copy. Interning strings
is useful to gain a little performance on dictionary lookup – if the
keys in a dictionary are interned, and the lookup key is interned, the
key comparisons (after hashing) can be done by a pointer compare
instead of a string compare. Normally, the names used in Python
programs are automatically interned, and the dictionaries used to hold
module, class or instance attributes have interned keys.Interned strings are not immortal; you must keep a reference to the
return value of intern() around to benefit from it.
As the documentation suggests, the sys.intern
function is intended to be used for performance optimization.
The sys.intern
function maintains a table of interned strings. When you attempt to intern a string, the function looks it up in the table and:
If the string does not exists (hasn't been interned yet) the function saves
it in the table and returns it from the interned strings table.
In the above example,>>> import sys
>>> a = sys.intern('why do pangolins dream of quiche')
>>> a
'why do pangolins dream of quiche'a
holds the interned string. Even though it is not visible, thesys.intern
function has saved the'why do pangolins dream of quiche'
string object in the interned strings table.If the string exists (has been interned) the function returns it from the
interned strings table.
Even though it is not immediately visible, because the string>>> b = sys.intern('why do pangolins dream of quiche')
>>> b
'why do pangolins dream of quiche''why do pangolins dream of quiche'
has been interned before,b
holds now the same string object asa
.
If we create the same string without using intern, we end up with two different string objects that have the same value.>>> b is a
True>>> c = 'why do pangolins dream of quiche'
>>> c is a
False
>>> c is b
False
sys.intern
you ensure that you never create two string objects that have the same value—when you request the creation of a second string object with the same value as an existing string object, you receive a reference to the pre-existing string object. This way, you are saving memory. Also, string objects comparison is now very efficient because it is carried out by comparing the memory addresses of the two string objects instead of their content. How python interns strings in interactive interpreter vs jupyter notebook
There are many things in python (and other languages) which may seem like they work, but go against the definition of how they're supposed to work. Object identity is one of those things. The purpose of the is
keyword is never to compare the value of something, but to test if two variables refer to the same underlying object. While it may seem to make sense that if they're the same object then the value must also be equal, but that statement is not true at all in reverse. This will sometimes work (as you have found) without throwing an exception, however it is not a defined feature of python. These are things which are "implementation dependent", and are never guaranteed to give correct or even stable results.
Apparently ipython does not submit chunks of code to the cpython binary in the same way it is submitted via the built-in REPL: https://github.com/satwikkansal/wtfpython/issues/100#issuecomment-549171287
I would assume this is to reduce the number of messages the front-end has to send to the kernel when sending multiple lines of code. I would expect the behavior of executing a .py file from the command line would better match the results you get from ipython in this regard.
Along these lines, it is sometimes possible to recover objects after deletion but before garbage collection because the implementation of the id
function returns the memory location of the object which can be used with ctypes
to construct a new PyObject
. This is very much a way to introduce bugs and instability into your code. If for some reason id
was switched out to a simple counter for each allocated item, (perhaps you want to protect against leaking any information about the process memory space) this would immediately break.
What determines which strings are interned and when?
String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.
Why and where python interned strings when executing `a = 'python'` while the source code does not show that?
The string literal is turned into a string object by the compiler. The function that does that is PyString_DecodeEscape
, at least in Py2.7, you haven't said what version you are working with.
Update:
The compiler interns some strings during compilation, but it is very confusing when it happens. The string needs to have only identifier-ok characters:
>>> a = 'python'
>>> b = 'python'
>>> a is b
True
>>> a = 'python!'
>>> b = 'python!'
>>> a is b
False
Even in functions, string literals can be interned:>>> def f():
... return 'python'
...
>>> def g():
... return 'python'
...
>>> f() is g()
True
But not if they have funny characters:>>> def f():
... return 'python!'
...
>>> def g():
... return 'python!'
...
>>> f() is g()
False
And if I return a pair of strings, none of them are interned, I don't know why:>>> def f():
... return 'python', 'python!'
...
>>> def g():
... return 'python', 'python!'
...
>>> a, b = f()
>>> c, d = g()
>>> a is c
False
>>> a == c
True
>>> b is d
False
>>> b == d
True
Moral of the story: interning is an implementation-dependent optimization that depends on many factors. It can be interesting to understand how it works, but never depend on it working any particular way. Why isn't `str(1) is '1'` `True` in Python?
is
checks for references, not content. Also, str(1)
is not a literal therefore it is not interned.
But '1'
is interned because it's directly a string. Whereas str(1)
goes through a process to become a string. As you can see:
>>> a = '1'
>>> b = str(1)
>>> a
'1'
>>> b
'1'
>>> a is b
False
>>> id(a)
1603954028784
>>> id(b)
1604083776304
>>>
So the way to make them both interned is with sys.intern
:>>> import sys
>>> a = '1'
>>> b = str(1)
>>> a is b
False
>>> a is sys.intern(b)
True
>>>
As mentioned in the docs:Enter string in the table of “interned” strings and return theNote that in Python 2
interned string – which is string itself or a copy. Interning strings
is useful to gain a little performance on dictionary lookup – if the
keys in a dictionary are interned, and the lookup key is interned, the
key comparisons (after hashing) can be done by a pointer compare
instead of a string compare. Normally, the names used in Python
programs are automatically interned, and the dictionaries used to hold
module, class or instance attributes have interned keys.Interned strings are not immortal; you must keep a reference to the
return value ofintern()
around to benefit from it.
intern()
was a built-in keyword, but now in python 3 it was merged into the sys
module to become sys.intern
Related Topics
Brew Installation of Python 3.6.1: [Ssl: Certificate_Verify_Failed] Certificate Verify Failed
Python and Operator on Two Boolean Lists - How
How to Convert an Integer to the Shortest Url-Safe String in Python
Dump to JSON Adds Additional Double Quotes and Escaping of Quotes
How to Check If Stdin Has Some Data
Loading Initial Data with Django 1.7 and Data Migrations
In Python, Why Is List[] Automatically Global
How to Get All the Request Headers in Django
Broken References in Virtualenvs
How to Plot Empirical Cdf (Ecdf)
Vectorized Numpy Linspace for Multiple Start and Stop Values
Display Realtime Output of a Subprocess in a Tkinter Widget
How to Use Digit Separators for Python Integer Literals
How to Have Shared Log Files Under Windows