When Does Python Allocate New Memory for Identical Strings

when does Python allocate new memory for identical strings?

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

How does Python determine if two strings are identical

From the link you posted:

Avoiding large .pyc files

So why does 'a' * 21 is 'When Does Python Allocate New Memory for Identical StringsWhen Does Python Allocate New Memory for Identical Stringsaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.

If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.

EDIT:

As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:

// ...
} else if (size > 20) {
    Py_DECREF(newconst);
    return -1;
}

EDIT 2:

Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.

How do I make Python make all identical strings use the same memory?

just do an intern(), which tells Python to store and take the string from memory:

a = [intern("foo".replace("o","1")) for a in range(0,1000000)]

This also results around 18MB, same as in the first example.

Also note the comment below, if you use python3. Thx @Abe Karplus

Why Python allocates new id to list, tuples, dict even though having same values?

Because otherwise this would happen:

x3 = [1,2,3]
y3 = [1,2,3]

x3[0] = "foo"
x3[0] == y3[0] # Does NOT happen!

In fact,

x3[0] != y3[0]

which is a Good Thing™. If x3 and y3 would be identical, changing one would change the other, too. That's generally not expected.

See also when does Python allocate new memory for identical strings? why the behaviour is different for strings.

Also, use == if you want to compare values.

What is the total size of two variables that have a shared object reference?

Neither, although 1x is closer. The total size is the sum of the target object (your string) plus two references (one address reference -- typically one word of storage -- each). You have a, b, and the string to which they refer.

why memory address are not same for same values

list1=[1,2,3,4]
list2=[1,2,3,4]

creates 2 objects that happen to hold same values. The objects being different they have different ids. (in this case you can modify either of them independently.)

list1=list2=[1,2,3,4]

creates 2 references to the same object. The objects being the same they have identical ids. (In this case you cannot modify list1 without changing list2.)

For strings it is a bit more subtle: python creates only one object "hello" even if you you do

a = "hello"
b = "hello"

BTW you may as well call id("hello") directly and find the same result.

When Does Python Allocate New Memory for Identical Strings