About the Changing Id of an Immutable String

About the changing id of an immutable string

CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.

Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.

As such, you will see the same id crop up from time to time.

Running just the line id(<string literal>) in the REPL goes through several steps:

  1. The line is compiled, which includes creating a constant for the string object:

    >>> compile("id('foo')", '<stdin>', 'single').co_consts
    ('foo', None)

    This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.

  2. On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:

    >>> import dis
    >>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
    1 0 LOAD_NAME 0 (id)
    3 LOAD_CONST 0 ('foo')
    6 CALL_FUNCTION 1
    9 PRINT_EXPR
    10 LOAD_CONST 1 (None)
    13 RETURN_VALUE
  3. The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.

Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.

ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.

Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:

if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}

where all_name_chars() is documented as

/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */

Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.

Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:

>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True

If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:

>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True

The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.

E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:

>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE

and

>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE

Confused why after 2nd evaluation of += operator of immutable string does not change the id in Python3

This is only possible due to a weird, slightly-sketchy optimization for string concatenation in the bytecode evaluation loop. The INPLACE_ADD implementation special-cases two string objects:

case TARGET(INPLACE_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(tstate, left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
...

and calls a unicode_concatenate helper that delegates to PyUnicode_Append, which tries to mutate the original string in-place:

void
PyUnicode_Append(PyObject **p_left, PyObject *right)
{
...
if (unicode_modifiable(left)
&& PyUnicode_CheckExact(right)
&& PyUnicode_KIND(right) <= PyUnicode_KIND(left)
/* Don't resize for ascii += latin1. Convert ascii to latin1 requires
to change the structure size, but characters are stored just after
the structure, and so it requires to move all characters which is
not so different than duplicating the string. */
&& !(PyUnicode_IS_ASCII(left) && !PyUnicode_IS_ASCII(right)))
{
/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
goto error;

/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
...

The optimization only happens if unicode_concatenate can guarantee there are no other references to the LHS. Your initial a="d" had other references, since Python uses a cache of 1-character strings in the Latin-1 range, so the optimization didn't trigger. The optimization can also fail to trigger in a few other cases, such as if the LHS has a cached hash, or if realloc needs to move the string (in which case most of the optimization's code path executes, but it doesn't succeed in performing the operation in-place).


This optimization violates the normal rules for id and +=. Normally, += on immutable objects is supposed to create a new object before clearing the reference to the old object, so the new and old objects should have overlapping lifetimes, forbidding equal id values. With the optimization in place, the string after the += has the same ID as the string before the +=.

The language developers decided they cared more about people who would put string concatenation in a loop, see bad performance, and assume Python sucks, than they cared about this obscure technical point.

CPython: Why does += for strings change the id of string variable

The optimization you are trying to trigger is an implementation detail of CPython and is a quite subtle thing: there are many details (e.f. one you are experiencing) which can be preventing it.

For a detailed explanation, one needs to dive into the CPython's implementation, so first I will try to give a hand-waving explanation, which should give at least the gist of what is going on. The gory details will be in the second part which highlights the important code-parts.


Let's take a look at this function, which exhibits the desired/optimized behavior

def add_str(str1, str2, n):
for i in range(n):
str1+=str2
print(id(str1))
return str1

Calling it, leads to the following output:

>>> add_str("1","2",100)
2660336425032
... 4 times
2660336425032
2660336418608
... 6 times
2660336418608
2660336361520
... 6 times
2660336361520
2660336281800
and so on

I.e. a new string is created only every 8 addition, otherwise the old string (or as we will see the memory) is reused. The first id is printed only 6 times because it starts printing when the size of the unicode-object is 2 modulo 8 (and not 0 as in the later cases).

The first question is, if a string is immutable in CPython, how (or better when) can it be changed? Obviously, we can't change the string if it is bound to different variables - but we could change it, if the current variable is the only one reference - which can be checked pretty easily due to reference counting of CPython (and it is the reason why this optimization isn't available for other implementation which don't use reference counting).

Let's change the function above by adding a additional reference:

def add_str2(str1, str2, n):
for i in range(n):
ref = str1
str1+=str2
print(id(str1))
return str1

Calling it leads to:

>>> add_str2("1","2",20)
2660336437656
2660337149168
2660337149296
2660337149168
2660337149296
... every time a different string - there is copying!

This actually explains your observation:

import sys
s = 'ab'
print(sys.getrefcount(s))
# 9
print(id(s))
# 2660273077752
s+='a'
print(id(s))
# 2660337158664 Different

Your string s is interned (see for example this SO-answer for more information about string interning and integer pool), and thus s is not only one "using" this string and thus this string cannot be changed.

If we avoid the interning, we can see, that the string is reused:

import sys
s = 'ab'*21 # will not be interned
print(sys.getrefcount(s))
# 2, that means really not interned
print(id(s))
# 2660336107312
s+='a'
print(id(s))
# 2660336107312 the same id!

But how does this optimization works?

CPython uses its own memory management - the pymalloc allocator, which is optimized for small objects with short lifetimes. The used memory-blocks are multiple of 8 bytes, that means if allocator is asked for only 1 byte, still 8 bytes are marked as used (more precise because of the 8-byte aligment of the returned pointers the the remaining 7 bytes cannot be used for other objects).

There is however the function PyMem_Realloc: if the allocator is asked to reallocate a 1byte-block as a 2byte-block, there is nothing to do - there were some reserved bytes anyway.

This way, if there is only one reference to the string, CPython can ask the allocator to reallocate the string and require a byte more. In 7 cases of 8 there is nothing to do for allocator and the additional byte becomes available almost free.

However, if the size of the string changes by more than 7 bytes, the copying becomes mandatory:

>>> add_str("1", "1"*8, 20)  # size change of 8
2660337148912
2660336695312
2660336517728
... every time another id

Furthermore, pymalloc falls back to PyMem_RawMalloc, which is usually the memory manager of the C-runtime, and the above optimization for strings is no longer possible:

>>> add_str("1"*512, "1", 20) #  str1 is larger as 512 bytes
2660318800256
2660318791040
2660318788736
2660318807744
2660318800256
2660318796224
... every time another id

Actually, whether the addresses are different after each reallocation depends on the memory allocator of the C-runtime and its state. If memory isn't defragmented, the chances are high, that realloc manages to extend memory without copying (but it was not the case on my machine as I did these experiments), see also this SO-post.


For the curious, here is the whole traceback of the str1+=str2 operation, which can be easily followed in a debugger:

That is what going on:

The += is compiled to BINARY_ADD-optcode and when evaluated in ceval.c, there is a hook/special handling for unicode objects (see PyUnicode_CheckExact):

case TARGET(BINARY_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
...
if (PyUnicode_CheckExact(left) &&
PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
...

unicode_concatenate ends up calling PyUnicode_Append, which checks whether the left-operand is modifiable (which basically checks that there is only one reference, string isn't interned and some further stuff) and resizes it or creates a new unicode-object otherwise:

if (unicode_modifiable(left)
&& ...)
{
/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
goto error;

/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
else {
...
/* Concat the two Unicode strings */
res = PyUnicode_New(new_len, maxchar);
if (res == NULL)
goto error;
_PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
_PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
Py_DECREF(left);
...
}

unicode_resize ends up calling resize_compact (mostly because in our case we have only ascii-characters), which ends up calling PyObject_REALLOC:

...
new_unicode = (PyObject *)PyObject_REALLOC(unicode, new_size);
...

which basically will be calling pymalloc_realloc:

static int
pymalloc_realloc(void *ctx, void **newptr_p, void *p, size_t nbytes)
{
...
/* pymalloc is in charge of this block */
size = INDEX2SIZE(pool->szidx);
if (nbytes <= size) {
/* The block is staying the same or shrinking.
....
*newptr_p = p;
return 1; // 1 means success!
...
}
...
}

Where INDEX2SIZE just rounds up to the nearest multiple of 8:

#define ALIGNMENT               8               /* must be 2^N */
#define ALIGNMENT_SHIFT 3

/* Return the number of bytes in size class I, as a uint. */
#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

qed.

How is string is immutable and I can replace it in 'python'

The fact that the ids were the same implies that they are pointing on the same object.

String in Python is immutable, you can't assign to it's indexes values.

Immutability means that we can change the original object:

# List is mutable
l = [1, 2, 3]
# That's OK
l[0] = 5
# String is NOT mutable
s = '123'
# That's NOT OK - and WON'T work!
s[0] = '5'

The following assignment means that we create another pointer, to the same string:

S = "Taxi"
T = "Zaxi"
# The following line of code will cause 'S' point on 'T' string
# The string referenced by 'S' before has not changed! (and not available)
S = T
# Assertion will pass, as we changed S to point on T, now they point on same string
assert id(S) == id(T)

The fact that strings are immutable makes it memory-inefficient to generate new strings based on existing strings.

In C#, where strings are immutable too, there is a class called StringBuilder which mitigates this in-efficiency.

On other alternatives to StringBuilder in Python read more in this great thread.

why mutable objects having same value have different id in Python

your interpretation is incorrect.

When you assign a new list to a, you change its reference.

On the other hand you could do:

a[:] = [3,2,1]

and then the reference would not change.

Aren't Python strings immutable? Then why does a + + b work?

First a pointed to the string "Dog". Then you changed the variable a to point at a new string "Dog eats treats". You didn't actually mutate the string "Dog". Strings are immutable, variables can point at whatever they want.



Related Topics



Leave a reply



Submit