Numpy Array Dtype Is Coming as Int32 by Default in a Windows 10 64 Bit MAChine

numpy array dtype is coming as int32 by default in a windows 10 64 bit machine

Original poster, Prana, asked a very good question. "Why is the integer default set to 32-bit, on a 64-bit machine?"

As near as I can tell, the short answer is: "Because it was designed wrong".
Seems obvious, that a 64-bit machine should default-define an integer in any associated interpreter as 64 bit. But of course, the two answers explain why this is not the case. Things are now different, and so I offer this update.

What I notice is that for both CentOS-7.4 Linux and MacOS 10.10.5 (the new and the old), running Python 2.7.14 (with Numpy 1.14.0 ), (as at January 2018), the default integer is now defined as 64-bit. (The "my_array.dtype" in the initial example would now report "dtype('int64')" on both platforms.

Using 32-bit integers as the default integer in any interpreter can result in very squirrelly results if you are doing integer math, as this question pointed out:

Using numpy to square value gives negative number

It appears now that Python and Numpy have been updated and revised (corrected, one might argue), so that in order to replicate the problem encountered as described in the above question, you have to explicitly define the Numpy array as int32.

In Python, on both platforms now, default integer looks to be int64. This code runs the same on both platforms (CentOS-7.4 and MacOSX 10.10.5):

>>> import numpy as np
>>> tlist = [1, 2, 47852]
>>> t_array = np.asarray(tlist)
>>> t_array.dtype

dtype('int64')

>>> print t_array ** 2

[ 1 4 2289813904]

But if we make the t_array a 32-bit integer, one gets the following, because of the integer calculation rolling over the sign bit in the 32-bit word.

>>> t_array32 = np.asarray(tlist, dtype=np.int32)
>>> t_array32.dtype

dtype*('int32')

>>> print t_array32 ** 2

[ 1 4 -2005153392]

The reason for using int32 is of course, efficiency. There are some situations (such as using TensorFlow or other neural-network machine learning tools), where you want to use 32-bit representations (mostly float, of course), as the speed gains versus using 64-bit floats, can be quite significant.

Numpy returns different results on windows and unix

I have enough comments that I think an "answer" is warranted.

What happened?

Not sure how it got a negative from all those positives

As @rafaelc points out, you ran into an integer overflow. You can read more details at the wikipedia link that was provided.

What caused the overflow?

According to this thread, numpy uses the operating system's C long type as the default dtype for integers. So when you write this line of code:

c = np.array(c)

The dtype defaults to numpy's default integer data type, which is the operating system's C long. The size of a long in Microsoft's C implementation for Windows is 4 bytes (x8 bits/byte = 32 bits), so your dtype defaults to a 32-bit integer.

Why did this calculation overflow?

In [1]: import numpy as np

In [2]: np.iinfo(np.int32)
Out[2]: iinfo(min=-2147483648, max=2147483647, dtype=int32)

The largest number a 32-bit, signed integer data type can represent is 2147483647. If you take a look at your product across just one axis:

In [5]: c * c.T
Out[5]:
array([[ 1, 8, 21],
[ 8, 25, 48],
[21, 48, 81]])

In [6]: (c * c.T).prod(axis=0)
Out[6]: array([ 168, 9600, 81648])

In [7]: 168 * 9600 * 81648
Out[7]: 131681894400

You can see that 131681894400 >> 2147483647 (in mathematics, the notation >> means "is much, much larger"). Since 131681894400 is much larger than the maximum integer the 32-bit long can represent, an overflow occurs.

But it's fine in Linux

In Linux, a long is 8 bytes (x8 bits/byte = 64 bits). Why? Here's an SO thread that discusses this in the comments.

"Is it a bug?"

No, although it's pretty annoying, I'll admit.

For what it's worth, it's usually a good idea to be explicit about your data types, so next time:

c = np.array(c, dtype='int64')

# or
c = np.array(c, dtype=np.int64)

Who do I report a bug to?

Again, this isn't a bug, but if it were, you'd open an issue on the numpy github (where you can also peruse the source code). Somewhere in there is proof of how numpy uses the operating system's default C long, but I don't have it in me to go digging around to find it.

I've discovered something strange

You are generating integers that are larger than what your platform or default numpy int values can handle, so are overflowing your numbers.

25 to the seventh power requires 33 bits:

>>> (25 ** 7).bit_length()
33

Python's built-in integer type is unbounded, but numpy uses bounded, fixed size integers, and for your numpy setup, the default signed integer type is int32, so fails to fit this value.

I can reproduce the exact same output if I tell numpy to convert the output to int32:

>>> np.sum(np.power(range(26), 7)).astype(np.int32)
792709145

but because I have am running MacOS on a 64-bit CPU, numpy uses int64 as the default integer type and so produces the correct result:

>>> np.sum(np.power(range(26), 7))
22267545625
>>> np.sum(np.power(range(26), 7)).dtype
dtype('int64')

792709145 is the integer value represented by the bottom 31 bits:

>>> print(int(format(22267545625, 'b')[-31:], 2))
792709145

On Windows however, the default numpy integer type is int32 because even on 64-bit hardware Windows defines C long int as a 32-bit value.

You should tell numpy to create an array of int64 values here:

np.sum(np.power(np.arange(26, dtype=np.int64), 7))

Pandas Dataframe: Why is astype method producing int32 results with an argument of int

Pandas uses numpy datatypes under the hood. From the numpy documentation,

The default NumPy behavior is to create arrays in either 32 or 64-bit
signed integers (platform dependent and matches C int size) or double
precision floating point numbers, int32/int64 and float, respectively.
If you expect your integer arrays to be a specific type, then you need
to specify the dtype while you create the array.

It is not a bug and you should be specifying dtypes if you have a specific use or want to be platform agnostic. To rephrase your question, what is np.dtype(int) on my platform?

On windows, as some of the comments suggest, it appears to be a C signed long (32 bits). You can even get numpy to throw an overflow error to confirm this.

>>> import numpy as np
>>> np.array([2_147_483_648], dtype=int)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

Why is the dtype shown (even if it's the native one) when using floor division with NumPy?

You actually have multiple distinct 32-bit integer dtypes here. This is probably a bug.

NumPy has (accidentally?) created multiple distinct signed 32-bit integer types, probably corresponding to C int and long. Both of them display as numpy.int32, but they're actually different objects. At C level, I believe the type objects are PyIntArrType_Type and PyLongArrType_Type, generated here.

dtype objects have a type attribute corresponding to the type object of scalars of that dtype. It is this type attribute that NumPy inspects when deciding whether to print dtype information in an array's repr:

_typelessdata = [int_, float_, complex_]
if issubclass(intc, int):
_typelessdata.append(intc)

if issubclass(longlong, int):
_typelessdata.append(longlong)

...

def array_repr(arr, max_line_width=None, precision=None, suppress_small=None):
...
skipdtype = (arr.dtype.type in _typelessdata) and arr.size > 0

if skipdtype:
return "%s(%s)" % (class_name, lst)
else:
...
return "%s(%s,%sdtype=%s)" % (class_name, lst, lf, typename)

On numpy.arange(5) and numpy.arange(5) + 3, .dtype.type is numpy.int_; on numpy.arange(5) // 3 or numpy.arange(5) % 3, .dtype.type is the other 32-bit signed integer type.

As for why + and // have different output dtypes, they use different type resolution routines. Here's the one for //, and here's the one for +. //'s type resolution looks for a ufunc inner loop that takes types the inputs can be safely cast to, while +'s type resolution applies NumPy type promotion to the arguments and picks the loop matching the resulting type.

A bug of Python's power operator **?

These values are too big to store in a 32-bit int which numpy uses by default. If you set the datatype to float (or 64-bit int) you get the proper results:

import numpy as np

print 2 ** np.array([32, 33], dtype=np.float)
# [ 4.2946730e+09 8.58993459e+09 ]

print 2 ** np.array([32, 33], dtype=np.int64) # 64-bit int as suggested by PM 2Ring
# [ 4294967296 8589934592]

jax linear solve issues

See the bug report for the root cause.

It is a long standing platform-wise behaviour and it just bites one more time, see numpy array dtype is coming as int32 by default in a windows 10 64 bit machine



Related Topics



Leave a reply



Submit