The 'Is' Operator Behaves Unexpectedly with Non-Cached Integers

The `is` operator behaves unexpectedly with non-cached integers

tl;dr:

As the reference manual states:

A block is a piece of Python program text that is executed as a unit.
The following are blocks: a module, a function body, and a class definition.
Each command typed interactively is a block.

This is why, in the case of a function, you have a single code block which contains a single object for the numeric literal
1000, so id(a) == id(b) will yield True.

In the second case, you have two distinct code objects each with their own different object for the literal 1000 so id(a) != id(b).

Take note that this behavior doesn't manifest with int literals only, you'll get similar results with, for example, float literals (see here).

Of course, comparing objects (except for explicit is None tests ) should always be done with the equality operator == and not is.

Everything stated here applies to the most popular implementation of Python, CPython. Other implementations might differ so no assumptions should be made when using them.



Longer Answer:

To get a little clearer view and additionally verify this seemingly odd behaviour we can look directly in the code objects for each of these cases using the dis module.

For the function func:

Along with all other attributes, function objects also have a __code__ attribute that allows you to peek into the compiled bytecode for that function. Using dis.code_info we can get a nice pretty view of all stored attributes in a code object for a given function:

>>> print(dis.code_info(func))
Name: func
Filename: <stdin>
Argument count: 0
Kw-only arguments: 0
Number of locals: 2
Stack size: 2
Flags: OPTIMIZED, NEWLOCALS, NOFREE
Constants:
0: None
1: 1000
Variable names:
0: a
1: b

We're only interested in the Constants entry for function func. In it, we can see that we have two values, None (always present) and 1000. We only have a single int instance that represents the constant 1000. This is the value that a and b are going to be assigned to when the function is invoked.

Accessing this value is easy via func.__code__.co_consts[1] and so, another way to view our a is b evaluation in the function would be like so:

>>> id(func.__code__.co_consts[1]) == id(func.__code__.co_consts[1]) 

Which, of course, will evaluate to True because we're referring to the same object.

For each interactive command:

As noted previously, each interactive command is interpreted as a single code block: parsed, compiled and evaluated independently.

We can get the code objects for each command via the compile built-in:

>>> com1 = compile("a=1000", filename="", mode="single")
>>> com2 = compile("b=1000", filename="", mode="single")

For each assignment statement, we will get a similar looking code object which looks like the following:

>>> print(dis.code_info(com1))
Name: <module>
Filename:
Argument count: 0
Kw-only arguments: 0
Number of locals: 0
Stack size: 1
Flags: NOFREE
Constants:
0: 1000
1: None
Names:
0: a

The same command for com2 looks the same but has a fundamental difference: each of the code objects com1 and com2 have different int instances representing the literal 1000. This is why, in this case, when we do a is b via the co_consts argument, we actually get:

>>> id(com1.co_consts[0]) == id(com2.co_consts[0])
False

Which agrees with what we actually got.

Different code objects, different contents.


Note: I was somewhat curious as to how exactly this happens in the source code and after digging through it I believe I finally found it.

During compilations phase the co_consts attribute is represented by a dictionary object. In compile.c we can actually see the initialization:

/* snippet for brevity */

u->u_lineno = 0;
u->u_col_offset = 0;
u->u_lineno_set = 0;
u->u_consts = PyDict_New();

/* snippet for brevity */

During compilation this is checked for already existing constants. See @Raymond Hettinger's answer below for a bit more on this.



Caveats:

  • Chained statements will evaluate to an identity check of True

    It should be more clear now why exactly the following evaluates to True:

     >>> a = 1000; b = 1000;
    >>> a is b

    In this case, by chaining the two assignment commands together we tell the interpreter to compile these together. As in the case for the function object, only one object for the literal 1000 will be created resulting in a True value when evaluated.

  • Execution on a module level yields True again:

    As previously mentioned, the reference manual states that:

    ... The following are blocks: a module ...

    So the same premise applies: we will have a single code object (for the module) and so, as a result, single values stored for each different literal.

  • The same doesn't apply for mutable objects:

Meaning that unless we explicitly initialize to the same mutable object (for example with a = b = []), the identity of the objects will never be equal, for example:

    a = []; b = []
a is b # always evaluates to False

Again, in the documentation, this is specified:

after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation, but after c = []; d = [], c and d are guaranteed to refer to two different, unique, newly created empty lists.

is operator behaves unexpectedly with integers

Take a look at this:

>>> a = 256
>>> b = 256
>>> id(a)
9987148
>>> id(b)
9987148
>>> a = 257
>>> b = 257
>>> id(a)
11662816
>>> id(b)
11662828

Here's what I found in the documentation for "Plain Integer Objects":

The current implementation keeps an array of integer objects for all integers between -5 and 256. When you create an int in that range you actually just get back a reference to the existing object.

Why does the 'is' operator behave unexpectedly with arithmetically equal expressions

When you do something like :

(case-1)

a = 1000
b = a

or (case-2)

a = 1000
b = 1000

Python is smart enough to know before hand that even after execution you won't need new memory.

So, python just before execution makes b an alias of a in the first case.

The second case is bit different.
Python is a true object oriented language, the literal 1000 is treated as an object. (Intuitively you can think as 1000 to be name of a const object).

So in second case a and b are technically, both becoming alias of 1000

Now in your example:

a = 1000
b = 1000 + a - a
print (a == b)
print (a is b)

while assignment of b, python doesn't know before hand what is going to be the value of a. When I say before-hand I mean before any form of calculation being started. So python reserves a new memory location for band then saves the output of the operation in this new memory location.

It is also worth noting this:

4-1 is 3
True

In this case, python doesn't saves this line with 4-1 but processes it before compilation to be 3, for runtime optimisation.

is' operator behaves unexpectedly with floats

This has to do with how is works. It checks for references instead of value. It returns True if either argument is assigned to the same object.

In this case, they are different instances; float(0) and float(0) have the same value ==, but are distinct entities as far as Python is concerned. CPython implementation also caches integers as singleton objects in this range -> [x | x ∈ ℤ ∧ -5 ≤ x ≤ 256 ]:

>>> 0.0 is 0.0
True
>>> float(0) is float(0) # Not the same reference, unique instances.
False

In this example we can demonstrate the integer caching principle:

>>> a = 256
>>> b = 256
>>> a is b
True
>>> a = 257
>>> b = 257
>>> a is b
False

Now, if floats are passed to float(), the float literal is simply returned (short-circuited), as in the same reference is used, as there's no need to instantiate a new float from an existing float:

>>> 0.0 is 0.0
True
>>> float(0.0) is float(0.0)
True

This can be demonstrated further by using int() also:

>>> int(256.0) is int(256.0)  # Same reference, cached.
True
>>> int(257.0) is int(257.0) # Different references are returned, not cached.
False
>>> 257 is 257 # Same reference.
True
>>> 257.0 is 257.0 # Same reference. As @Martijn Pieters pointed out.
True

However, the results of is are also dependant on the scope it is being executed in (beyond the span of this question/explanation), please refer to user: @Jim's fantastic explanation on code objects. Even python's doc includes a section on this behavior:

  • 5.9 Comparisons

[7]
Due to automatic garbage-collection, free lists, and the dynamic nature of descriptors, you may notice seemingly unusual behaviour in certain uses of the is operator, like those involving comparisons between instance methods, or constants. Check their documentation for more info.

Why are the addresses of the items of a list are same?

In the standard case of object programming: when you create a new object, the language/computer will reserve a new memory space and adress for it.

There is a special case in Python as numbers are immutable
and will never change, so Python may optimizes this by pointing to the same memory address when the same number is reused, this applies for numbers between -5 and 256. (it is a kind of Singleton pattern)

Warning
This may be a specific implementation detail and a undefined behavior, so you should not rely on it.

This optimization choice was made because creating a new number object each time would have been too much time and memory consuming. And also the number of memory allocation/deallocation would have been too much also.

So working with "is" operator may return true is the two numbers are the same and between -5 and 256.

Best Practice
Do not rely on this implementation which may change
And there is no point using the "is" operator here, working with "==" to compare content and not adresses is the way to go.

is operator in Python

is is not an equality operator. It checks to see if two variables refer to the same object. If you were to do this:

a = "12 34'
b = a

then a is b would be True, since they refer to the same object.

The cases you present are due to implementation details of the Python interpreter. Since strings are immutable, in some cases, creating two of the same string will yield references to the same object -- i.e., in your first case, the Python interpreter only creates a single copy of "1234", and a and b both refer to that object. In the second case, the interpreter creates two copies. This is due to the way the interpreter creates and handles strings, and, as an implementation detail, should not be relied upon.

Why does the float object behave differently with the is operator?

Mutable objects always create a new object, otherwise the data would be shared. There's not much to explain here, as if you append an item to an empty list, you don't want all of the empty lists to have that item.

Immutable objects behave in a completely different manner:

  • Strings get interned. If they are smaller than 20 alphanumeric characters, and are static (consts in the code, function names, etc), they get cached and are accessed from a special mapping reserved for these. It is to save memory but more importantly used to have a faster comparison. Python uses a lot of dictionary access operations under the hood which require string comparison. Being able to compare 2 strings like attribute or function names by comparing their memory address instead of the actual value, is a significant runtime improvement.

  • Booleans simply return the same object. Considering there are only 2 available, it makes no sense creating them again and again.

  • Small integers (from -5 to 256) by default, are also cached. These are used quite often, just about everywhere. Every time an integer is in that range, CPython simply returns the same object.

Floats however are not cached. Unlike integers, where the numbers 0-10 are extremely common, 1.0 isn't guaranteed to be more used than 2.0 or 0.1. That's why float() simply returns a new float. We could have optimized the empty float(), and we can check for speed benefits but it might not have made such a difference.

The confusion starts to arise when float(0.0) is float(0.0). Python has numerous optimizations built in:

  • First of all, consts are saved in each function's code object. 0.0 is 0.0 simply refers to the same object. It is a compile-time optimization.

  • Second of all, float(0.0) takes the 0.0 object, and since it's a float (which is immutable), it simply returns it. No need to create a new object if it's already a float.

  • Lastly, 1.0 + 1.0 is 2.0 will also work. The reason is that 1.0 + 1.0 is calculated on compile time and then references the same 2.0 object:

    def test():
    return 1.0 + 1.0 is 2.0

    dis.dis(test)
    2 0 LOAD_CONST 1 (2.0)
    2 LOAD_CONST 1 (2.0)
    4 IS_OP 0
    6 RETURN_VALUE

    As you can see, there is no addition operation. The function was compiled with the result pointing to the exact same constant object.

So while there is no float-specific optimization, 3 different generic optimizations are into play. The sum of them is what ultimately decides if it'll be the same object or not.

What is the difference between destructured assignment and normal assignment?

I know nothing about Python but I was curious.

First, this happens when assigning an array too:

x = [-10,-10]
x[0] is x[1] # True

It also happens with strings, which are immutable.

x = ['foo', 'foo']
x[0] is x[1] # True

Disassembly of the first function:

         0 LOAD_CONST               1 (-10)
3 LOAD_CONST 1 (-10)
6 BUILD_LIST 2
9 STORE_FAST 0 (x)

The LOAD_CONST (consti) op pushes constant co_consts[consti] onto the stack. But both ops here have consti=1, so the same object is being pushed to the stack twice. If the numbers in the array were different, it would disassemble to this:

         0 LOAD_CONST               1 (-10)
3 LOAD_CONST 2 (-20)
6 BUILD_LIST 2
9 STORE_FAST 0 (x)

Here, constants of index 1 and 2 are pushed.

co_consts is a tuple of constants used by a Python script. Evidently literals with the same value are only stored once.

As for why 'normal' assignment works - you're using the REPL so I assume each line is compiled seperately. If you put

x = -10
y = -10
print(x is y)

into a test script, you'll get True. So normal assignment and destructured assignment both work the same in this regard :)

Odd Python ID assignment for Int values == Inconsistent 'is' operation

This is documented behavior of Python:

The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object.
source

It helps to save memory and to make operations a bit faster.
It is implementation-specific. For example, IronPython has a range between -1000 and 1000 in which it it re-uses integers.

Deeper understanding of Python object mechanisms

Immutable atomic objects are singletons

Nope, some are and some aren't, this is a detail of the CPython implementation.

  • Integers in the range (-6, 256] are cached and when a new request for these is made the already existing objects are returned. Numbers outside that range are subject to constant folding where the interpreter re-uses constants during compilation as a slight optimization. This is documented in the section on creating new PyLong objects.

    Also, see the following for a discussion on these:

    • 'is' operator behaves unexpectedly with non-cached integers
    • "is" operator behaves unexpectedly with integers
  • Strings literals are subject to interning during the compilation to bytecode as do ints. The rules for governing this are not as simplistic as for ints, though: Strings under a certain size composed of certain characters are only considered. I am not aware of any section in the docs specifying this, you could take a look at the behavior by reading here.

  • Floats, for example, which could be considered "atomic" (even though in Python that term doesn't have the meaning you think) there are no singletons:

    i = 1.0
    j = 1.0
    i is j # False

    they are still of course subject to constant folding. As you can see by reading: 'is' operator behaves unexpectedly with floats

Immutable and Iterable objects might be singletons but not exactly instead they hash equally

Empty immutables collections are signletons; this is again an implementation detail that can't be found in the Python Reference but truly only discovered if you look at the source.

See here for a look at the implementation: Why does '() is ()' return True when '[] is []' and '{} is {}' return False?

Passing dictionary by double dereferencing works as shallow copy of it.

Yes. Though the term isn't double dereferencing, it is unpacking.

Are those behaviours well documented somewhere?

Those that are considered an implementation detail needn't be documented in the way you'd find documentation for the max function for example. These are specific things that might easily change if the decision is made so.



Related Topics



Leave a reply



Submit