Does Python Intern Strings

Does Python intern strings?

This is called interning, and yes, Python does do this to some extent, for shorter strings created as string literals. See About the changing id of an immutable string for some discussion.

Interning is runtime dependent, there is no standard for it. Interning is always a trade-off between memory use and the cost of checking if you are creating the same string. There is the sys.intern() function to force the issue if you are so inclined, which documents some of the interning Python does for you automatically:

Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Note that Python 2 the intern() function used to be a built-in, no import necessary.

Python string interning

This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.

In what follows CPython 3.9.0+ is used.

In the second example, the expression "strin"+"g" is evaluated at compile time, and is replaced with "string". This makes the first two examples behave the same.

If we examine the bytecodes, we'll see that they are exactly the same:

  # s1 = "string"
1 0 LOAD_CONST 0 ('string')
2 STORE_NAME 0 (s1)

# s2 = "strin" + "g"
2 4 LOAD_CONST 0 ('string')
6 STORE_NAME 1 (s2)

This bytecode was obtained with (which prints a few more lines after the above):

import dis

source = 's1 = "string"\ns2 = "strin" + "g"'
code = compile(source, '', 'exec')
print(dis.dis(code))

The third example involves a run-time concatenation, the result of which is not automatically interned:

  # s3a = "strin"
3 8 LOAD_CONST 1 ('strin')
10 STORE_NAME 2 (s3a)

# s3 = s3a + "g"
4 12 LOAD_NAME 2 (s3a)
14 LOAD_CONST 2 ('g')
16 BINARY_ADD
18 STORE_NAME 3 (s3)
20 LOAD_CONST 3 (None)
22 RETURN_VALUE

This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):

import dis

source = (
's1 = "string"\n'
's2 = "strin" + "g"\n'
's3a = "strin"\n'
's3 = s3a + "g"')
code = compile(source, '', 'exec')
print(dis.dis(code))

If you were to manually sys.intern() the result of the third expression, you'd get the same object as before:

>>> import sys
>>> s3a = "strin"
>>> s3 = s3a + "g"
>>> s3 is "string"
False
>>> sys.intern(s3) is "string"
True

Also, Python 3.9 prints a warning for the last two statements above:

SyntaxWarning: "is" with a literal. Did you mean "=="?

What does sys.intern() do and when should it be used?

From the Python 3 documentation:

sys.intern(string)

Enter string in the table of “interned” strings and return the
interned string – which is string itself or a copy. Interning strings
is useful to gain a little performance on dictionary lookup – if the
keys in a dictionary are interned, and the lookup key is interned, the
key comparisons (after hashing) can be done by a pointer compare
instead of a string compare. Normally, the names used in Python
programs are automatically interned, and the dictionaries used to hold
module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the
return value of intern() around to benefit from it.

Clarification:

As the documentation suggests, the sys.intern function is intended to be used for performance optimization.

The sys.intern function maintains a table of interned strings. When you attempt to intern a string, the function looks it up in the table and:

  1. If the string does not exists (hasn't been interned yet) the function saves
    it in the table and returns it from the interned strings table.

    >>> import sys
    >>> a = sys.intern('why do pangolins dream of quiche')
    >>> a
    'why do pangolins dream of quiche'

    In the above example, a holds the interned string. Even though it is not visible, the sys.intern function has saved the 'why do pangolins dream of quiche' string object in the interned strings table.

  2. If the string exists (has been interned) the function returns it from the
    interned strings table.

    >>> b = sys.intern('why do pangolins dream of quiche')
    >>> b
    'why do pangolins dream of quiche'

    Even though it is not immediately visible, because the string 'why do pangolins dream of quiche' has been interned before, b holds now the same string object as a.

    >>> b is a
    True

    If we create the same string without using intern, we end up with two different string objects that have the same value.

    >>> c = 'why do pangolins dream of quiche'
    >>> c is a
    False
    >>> c is b
    False

By using sys.intern you ensure that you never create two string objects that have the same value—when you request the creation of a second string object with the same value as an existing string object, you receive a reference to the pre-existing string object. This way, you are saving memory. Also, string objects comparison is now very efficient because it is carried out by comparing the memory addresses of the two string objects instead of their content.

How python interns strings in interactive interpreter vs jupyter notebook

There are many things in python (and other languages) which may seem like they work, but go against the definition of how they're supposed to work. Object identity is one of those things. The purpose of the is keyword is never to compare the value of something, but to test if two variables refer to the same underlying object. While it may seem to make sense that if they're the same object then the value must also be equal, but that statement is not true at all in reverse. This will sometimes work (as you have found) without throwing an exception, however it is not a defined feature of python. These are things which are "implementation dependent", and are never guaranteed to give correct or even stable results.

Apparently ipython does not submit chunks of code to the cpython binary in the same way it is submitted via the built-in REPL: https://github.com/satwikkansal/wtfpython/issues/100#issuecomment-549171287

I would assume this is to reduce the number of messages the front-end has to send to the kernel when sending multiple lines of code. I would expect the behavior of executing a .py file from the command line would better match the results you get from ipython in this regard.

Along these lines, it is sometimes possible to recover objects after deletion but before garbage collection because the implementation of the id function returns the memory location of the object which can be used with ctypes to construct a new PyObject. This is very much a way to introduce bugs and instability into your code. If for some reason id was switched out to a simple counter for each allocated item, (perhaps you want to protect against leaking any information about the process memory space) this would immediately break.

What determines which strings are interned and when?

String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.

Why and where python interned strings when executing `a = 'python'` while the source code does not show that?

The string literal is turned into a string object by the compiler. The function that does that is PyString_DecodeEscape, at least in Py2.7, you haven't said what version you are working with.

Update:

The compiler interns some strings during compilation, but it is very confusing when it happens. The string needs to have only identifier-ok characters:

>>> a = 'python'
>>> b = 'python'
>>> a is b
True
>>> a = 'python!'
>>> b = 'python!'
>>> a is b
False

Even in functions, string literals can be interned:

>>> def f():
... return 'python'
...
>>> def g():
... return 'python'
...
>>> f() is g()
True

But not if they have funny characters:

>>> def f():
... return 'python!'
...
>>> def g():
... return 'python!'
...
>>> f() is g()
False

And if I return a pair of strings, none of them are interned, I don't know why:

>>> def f():
... return 'python', 'python!'
...
>>> def g():
... return 'python', 'python!'
...
>>> a, b = f()
>>> c, d = g()
>>> a is c
False
>>> a == c
True
>>> b is d
False
>>> b == d
True

Moral of the story: interning is an implementation-dependent optimization that depends on many factors. It can be interesting to understand how it works, but never depend on it working any particular way.

Why isn't `str(1) is '1'` `True` in Python?

is checks for references, not content. Also, str(1) is not a literal therefore it is not interned.

But '1' is interned because it's directly a string. Whereas str(1) goes through a process to become a string. As you can see:

>>> a = '1'
>>> b = str(1)
>>> a
'1'
>>> b
'1'
>>> a is b
False
>>> id(a)
1603954028784
>>> id(b)
1604083776304
>>>

So the way to make them both interned is with sys.intern:

>>> import sys
>>> a = '1'
>>> b = str(1)
>>> a is b
False
>>> a is sys.intern(b)
True
>>>

As mentioned in the docs:

Enter string in the table of “interned” strings and return the
interned string – which is string itself or a copy. Interning strings
is useful to gain a little performance on dictionary lookup – if the
keys in a dictionary are interned, and the lookup key is interned, the
key comparisons (after hashing) can be done by a pointer compare
instead of a string compare. Normally, the names used in Python
programs are automatically interned, and the dictionaries used to hold
module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the
return value of intern() around to benefit from it.

Note that in Python 2 intern() was a built-in keyword, but now in python 3 it was merged into the sys module to become sys.intern



Related Topics



Leave a reply



Submit