Python string interning
This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.
In what follows CPython 3.9.0+ is used.
In the second example, the expression "strin"+"g"
is evaluated at compile time, and is replaced with "string"
. This makes the first two examples behave the same.
If we examine the bytecodes, we'll see that they are exactly the same:
# s1 = "string"
1 0 LOAD_CONST 0 ('string')
2 STORE_NAME 0 (s1)
# s2 = "strin" + "g"
2 4 LOAD_CONST 0 ('string')
6 STORE_NAME 1 (s2)
This bytecode was obtained with (which prints a few more lines after the above):
import dis
source = 's1 = "string"\ns2 = "strin" + "g"'
code = compile(source, '', 'exec')
print(dis.dis(code))
The third example involves a run-time concatenation, the result of which is not automatically interned:
# s3a = "strin"
3 8 LOAD_CONST 1 ('strin')
10 STORE_NAME 2 (s3a)
# s3 = s3a + "g"
4 12 LOAD_NAME 2 (s3a)
14 LOAD_CONST 2 ('g')
16 BINARY_ADD
18 STORE_NAME 3 (s3)
20 LOAD_CONST 3 (None)
22 RETURN_VALUE
This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):
import dis
source = (
's1 = "string"\n'
's2 = "strin" + "g"\n'
's3a = "strin"\n'
's3 = s3a + "g"')
code = compile(source, '', 'exec')
print(dis.dis(code))
If you were to manually sys.intern()
the result of the third expression, you'd get the same object as before:
>>> import sys
>>> s3a = "strin"
>>> s3 = s3a + "g"
>>> s3 is "string"
False
>>> sys.intern(s3) is "string"
True
Also, Python 3.9 prints a warning for the last two statements above:
SyntaxWarning: "is" with a literal. Did you mean "=="?
Simplest way to defeat string interning in Python
You could also subclass str
.
>>> class UninternedStr(str):
... pass
...
>>> s = UninternedStr('a string')
>>> s1 = UninternedStr('a string')
>>> s is s1
False
Does Python intern strings?
This is called interning, and yes, Python does do this to some extent, for shorter strings created as string literals. See About the changing id of an immutable string for some discussion.
Interning is runtime dependent, there is no standard for it. Interning is always a trade-off between memory use and the cost of checking if you are creating the same string. There is the sys.intern()
function to force the issue if you are so inclined, which documents some of the interning Python does for you automatically:
Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
Note that Python 2 the intern()
function used to be a built-in, no import necessary.
manually implementing python string interning
I ran into this when I was converting some records from one format to another. The process involved many steps of string transformation and even after using this it still required 14GB of memory to represent all the strings and dictionaries.
What I found was that sys.intern
didn't work on Unicode strings and Unicode strings weren't weak referencable so I couldn't use a weak key dictionary. So I used a regular dictionary and wrote:
_objs = {}
def intern(x):
return _objs.setdefault(x, x)
This is better than sys.intern
as it works on any hashable type. And can be cleared. If you pre-populate the dictionary, you can use .get(x, x)
instead if you would prefer the dictionary not to accumulate values.
If needed, one could even use a dictionary with a limited size and an eviction policy.
What are the rules for cpython's string interning?
You think there are rules?
The only rule for interning is that the return value of intern
is interned. Everything else is up to the whims of whoever decided some piece of code should or shouldn't do interning. For example, "left"
gets interned by PyCodeNew
:
/* Intern selected string constants */
for (i = PyTuple_GET_SIZE(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!all_name_chars(v))
continue;
PyUnicode_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
The "rule" here is that a string object in the co_consts
of a Python code object gets interned if it consists purely of ASCII characters that are legal in a Python identifier. "left"
gets interned, but "as,df"
wouldn't be, and "1234"
would be interned even though an identifier can't start with a digit. While identifiers can contain non-ASCII characters, such characters are still rejected by this check. Actual identifiers don't ever pass through this code; they get unconditionally interned a few lines up, ASCII or not. This code is subject to change, and there's plenty of other code that does interning or interning-like things.
Asking us for the "rules" for string interning is like asking a meteorologist what the rules are for whether it rains on your wedding. We can tell you quite a lot about how it works, but it won't be much use to you, and you'll always get surprises.
Python not interning strings when in interactive mode?
This is caused by string interning. See this question for another example.
In your example, CPython interns the string constants in the module but doesn't in the REPL.
String interning in dictionary keys
Yes, strings and integers are interned, but only small strings and integers.
>>> list1 = ['a', 'b', 'c', 'longer and more complicated string']
>>> list2 = ['a', 'b', 'c', 'longer and more complicated string']
>>> list1[0] is list2[0]
True
>>> list1[1] is list2[1]
True
>>> list1[2] is list2[2]
True
>>> list1[3] is list2[3]
False
Two dicts with the same keys are allowed to have completely different values, however - the key-value mapping is tied to the dict's instance (and also to the hashes of the keys, moreso than the keys themselves), not to the key's instance, and dicts are not interned at all.
>>> dict1 = {'a': 1, 'b': 2}
>>> dict2 = {'a': 3, 'b': 4}
>>> for (key1, key2) in zip(dict1.keys(), dict2.keys()):
... print(key1 is key2, end="; ")
... print(dict1[key1] is dict2[key2])
...
True; False
True; False
If you wish to save memory by having only one key-value mapping, have you considered making the dictionary values be tuples? e.g.
# instead of
dict1[key] -> value1
dict2[key] -> value2
# do
dictx[key] -> (value1, value2)
Attempting to replicate Python's string interning functionality for non-strings
Yes, implementing a __new__
method that returns a cached object is the appropriate way of creating a limited number of instances. If you don't expect to be creating a lot of instances, you could just implement __eq__
and compare by value rather than identity, but it doesn't hurt to do it this way instead.
Note that an immutable object should generally do all its initialization in __new__
, rather than __init__
, since the latter is called after the object has been created. Further, __init__
will be called on any instance of the class that is returned from __new__
, so with when you're caching, it will be called again each time a cached object is returned.
Also, the first argument to __new__
is the class object not an instance, so you probably should name it cls
rather than self
(you can use self
instead of instance
later in the method if you want though!).
Related Topics
How to Sort a List of Objects Based on an Attribute of the Objects
How to Print a Number Using Commas as Thousands Separators
How to Properly Determine the Current Script Directory
Should I Put #! (Shebang) in Python Scripts, and What Form Should It Take
How to Prevent Tensorflow from Allocating the Totality of a Gpu Memory
Python Function Global Variables
List Comprehension Vs. Lambda + Filter
Random String Generation With Upper Case Letters and Digits
How to Create a Daemon in Python
Convert List of Dictionaries to a Pandas Dataframe
Wait Until Page Is Loaded With Selenium Webdriver For Python
How to Find the Duplicates in a List and Create Another List With Them
What's the U Prefix in a Python String
Finding Local Ip Addresses Using Python'S Stdlib
Python Multiprocessing Picklingerror: Can't Pickle ≪Type 'Function'≫
How to Pass a String into Subprocess.Popen (Using the Stdin Argument)
How to Resize an Image Using Pil and Maintain Its Aspect Ratio