Python String Interning

Python string interning

This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.

In what follows CPython 3.9.0+ is used.

In the second example, the expression "strin"+"g" is evaluated at compile time, and is replaced with "string". This makes the first two examples behave the same.

If we examine the bytecodes, we'll see that they are exactly the same:

  # s1 = "string"
1 0 LOAD_CONST 0 ('string')
2 STORE_NAME 0 (s1)

# s2 = "strin" + "g"
2 4 LOAD_CONST 0 ('string')
6 STORE_NAME 1 (s2)

This bytecode was obtained with (which prints a few more lines after the above):

import dis

source = 's1 = "string"\ns2 = "strin" + "g"'
code = compile(source, '', 'exec')
print(dis.dis(code))

The third example involves a run-time concatenation, the result of which is not automatically interned:

  # s3a = "strin"
3 8 LOAD_CONST 1 ('strin')
10 STORE_NAME 2 (s3a)

# s3 = s3a + "g"
4 12 LOAD_NAME 2 (s3a)
14 LOAD_CONST 2 ('g')
16 BINARY_ADD
18 STORE_NAME 3 (s3)
20 LOAD_CONST 3 (None)
22 RETURN_VALUE

This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):

import dis

source = (
's1 = "string"\n'
's2 = "strin" + "g"\n'
's3a = "strin"\n'
's3 = s3a + "g"')
code = compile(source, '', 'exec')
print(dis.dis(code))

If you were to manually sys.intern() the result of the third expression, you'd get the same object as before:

>>> import sys
>>> s3a = "strin"
>>> s3 = s3a + "g"
>>> s3 is "string"
False
>>> sys.intern(s3) is "string"
True

Also, Python 3.9 prints a warning for the last two statements above:

SyntaxWarning: "is" with a literal. Did you mean "=="?

Simplest way to defeat string interning in Python

You could also subclass str.

>>> class UninternedStr(str):
... pass
...
>>> s = UninternedStr('a string')
>>> s1 = UninternedStr('a string')
>>> s is s1
False

Does Python intern strings?

This is called interning, and yes, Python does do this to some extent, for shorter strings created as string literals. See About the changing id of an immutable string for some discussion.

Interning is runtime dependent, there is no standard for it. Interning is always a trade-off between memory use and the cost of checking if you are creating the same string. There is the sys.intern() function to force the issue if you are so inclined, which documents some of the interning Python does for you automatically:

Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Note that Python 2 the intern() function used to be a built-in, no import necessary.

manually implementing python string interning

I ran into this when I was converting some records from one format to another. The process involved many steps of string transformation and even after using this it still required 14GB of memory to represent all the strings and dictionaries.

What I found was that sys.intern didn't work on Unicode strings and Unicode strings weren't weak referencable so I couldn't use a weak key dictionary. So I used a regular dictionary and wrote:

_objs = {}
def intern(x):
return _objs.setdefault(x, x)

This is better than sys.intern as it works on any hashable type. And can be cleared. If you pre-populate the dictionary, you can use .get(x, x) instead if you would prefer the dictionary not to accumulate values.

If needed, one could even use a dictionary with a limited size and an eviction policy.

What are the rules for cpython's string interning?

You think there are rules?

The only rule for interning is that the return value of intern is interned. Everything else is up to the whims of whoever decided some piece of code should or shouldn't do interning. For example, "left" gets interned by PyCodeNew:

/* Intern selected string constants */
for (i = PyTuple_GET_SIZE(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!all_name_chars(v))
continue;
PyUnicode_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}

The "rule" here is that a string object in the co_consts of a Python code object gets interned if it consists purely of ASCII characters that are legal in a Python identifier. "left" gets interned, but "as,df" wouldn't be, and "1234" would be interned even though an identifier can't start with a digit. While identifiers can contain non-ASCII characters, such characters are still rejected by this check. Actual identifiers don't ever pass through this code; they get unconditionally interned a few lines up, ASCII or not. This code is subject to change, and there's plenty of other code that does interning or interning-like things.

Asking us for the "rules" for string interning is like asking a meteorologist what the rules are for whether it rains on your wedding. We can tell you quite a lot about how it works, but it won't be much use to you, and you'll always get surprises.

Python not interning strings when in interactive mode?

This is caused by string interning. See this question for another example.

In your example, CPython interns the string constants in the module but doesn't in the REPL.

String interning in dictionary keys

Yes, strings and integers are interned, but only small strings and integers.

>>> list1 = ['a', 'b', 'c', 'longer and more complicated string']
>>> list2 = ['a', 'b', 'c', 'longer and more complicated string']
>>> list1[0] is list2[0]
True
>>> list1[1] is list2[1]
True
>>> list1[2] is list2[2]
True
>>> list1[3] is list2[3]
False

Two dicts with the same keys are allowed to have completely different values, however - the key-value mapping is tied to the dict's instance (and also to the hashes of the keys, moreso than the keys themselves), not to the key's instance, and dicts are not interned at all.

>>> dict1 = {'a': 1, 'b': 2}
>>> dict2 = {'a': 3, 'b': 4}
>>> for (key1, key2) in zip(dict1.keys(), dict2.keys()):
... print(key1 is key2, end="; ")
... print(dict1[key1] is dict2[key2])
...
True; False
True; False

If you wish to save memory by having only one key-value mapping, have you considered making the dictionary values be tuples? e.g.

# instead of
dict1[key] -> value1
dict2[key] -> value2

# do
dictx[key] -> (value1, value2)

Attempting to replicate Python's string interning functionality for non-strings

Yes, implementing a __new__ method that returns a cached object is the appropriate way of creating a limited number of instances. If you don't expect to be creating a lot of instances, you could just implement __eq__ and compare by value rather than identity, but it doesn't hurt to do it this way instead.

Note that an immutable object should generally do all its initialization in __new__, rather than __init__, since the latter is called after the object has been created. Further, __init__ will be called on any instance of the class that is returned from __new__, so with when you're caching, it will be called again each time a cached object is returned.

Also, the first argument to __new__ is the class object not an instance, so you probably should name it cls rather than self (you can use self instead of instance later in the method if you want though!).



Related Topics



Leave a reply



Submit