numpy array dtype is coming as int32 by default in a windows 10 64 bit machine
Original poster, Prana, asked a very good question. "Why is the integer default set to 32-bit, on a 64-bit machine?"
As near as I can tell, the short answer is: "Because it was designed wrong".
Seems obvious, that a 64-bit machine should default-define an integer in any associated interpreter as 64 bit. But of course, the two answers explain why this is not the case. Things are now different, and so I offer this update.
What I notice is that for both CentOS-7.4 Linux and MacOS 10.10.5 (the new and the old), running Python 2.7.14 (with Numpy 1.14.0 ), (as at January 2018), the default integer is now defined as 64-bit. (The "my_array.dtype" in the initial example would now report "dtype('int64')" on both platforms.
Using 32-bit integers as the default integer in any interpreter can result in very squirrelly results if you are doing integer math, as this question pointed out:
Using numpy to square value gives negative number
It appears now that Python and Numpy have been updated and revised (corrected, one might argue), so that in order to replicate the problem encountered as described in the above question, you have to explicitly define the Numpy array as int32.
In Python, on both platforms now, default integer looks to be int64. This code runs the same on both platforms (CentOS-7.4 and MacOSX 10.10.5):
>>> import numpy as np
>>> tlist = [1, 2, 47852]
>>> t_array = np.asarray(tlist)
>>> t_array.dtype
dtype('int64')
>>> print t_array ** 2
[ 1 4 2289813904]
But if we make the t_array a 32-bit integer, one gets the following, because of the integer calculation rolling over the sign bit in the 32-bit word.
>>> t_array32 = np.asarray(tlist, dtype=np.int32)
>>> t_array32.dtype
dtype*('int32')
>>> print t_array32 ** 2
[ 1 4 -2005153392]
The reason for using int32 is of course, efficiency. There are some situations (such as using TensorFlow or other neural-network machine learning tools), where you want to use 32-bit representations (mostly float, of course), as the speed gains versus using 64-bit floats, can be quite significant.
Numpy returns different results on windows and unix
I have enough comments that I think an "answer" is warranted.
What happened?
Not sure how it got a negative from all those positivesAs @rafaelc points out, you ran into an integer overflow. You can read more details at the wikipedia link that was provided.
What caused the overflow?
According to this thread, numpy uses the operating system's C long
type as the default dtype
for integers. So when you write this line of code:
c = np.array(c)
The dtype
defaults to numpy's default integer data type, which is the operating system's C long
. The size of a long
in Microsoft's C implementation for Windows is 4 bytes (x8 bits/byte = 32 bits), so your dtype
defaults to a 32-bit integer.Why did this calculation overflow?
In [1]: import numpy as np
In [2]: np.iinfo(np.int32)
Out[2]: iinfo(min=-2147483648, max=2147483647, dtype=int32)
The largest number a 32-bit, signed integer data type can represent is 2147483647
. If you take a look at your product across just one axis:In [5]: c * c.T
Out[5]:
array([[ 1, 8, 21],
[ 8, 25, 48],
[21, 48, 81]])
In [6]: (c * c.T).prod(axis=0)
Out[6]: array([ 168, 9600, 81648])
In [7]: 168 * 9600 * 81648
Out[7]: 131681894400
You can see that 131681894400 >> 2147483647
(in mathematics, the notation >>
means "is much, much larger"). Since 131681894400
is much larger than the maximum integer the 32-bit long
can represent, an overflow occurs.But it's fine in Linux
In Linux, a long
is 8 bytes (x8 bits/byte = 64 bits). Why? Here's an SO thread that discusses this in the comments.
"Is it a bug?"
No, although it's pretty annoying, I'll admit.
For what it's worth, it's usually a good idea to be explicit about your data types, so next time:
c = np.array(c, dtype='int64')
# or
c = np.array(c, dtype=np.int64)
Who do I report a bug to?
Again, this isn't a bug, but if it were, you'd open an issue on the numpy github (where you can also peruse the source code). Somewhere in there is proof of how numpy uses the operating system's default C long
, but I don't have it in me to go digging around to find it.
I've discovered something strange
You are generating integers that are larger than what your platform or default numpy int values can handle, so are overflowing your numbers.
25 to the seventh power requires 33 bits:
>>> (25 ** 7).bit_length()
33
Python's built-in integer type is unbounded, but numpy uses bounded, fixed size integers, and for your numpy setup, the default signed integer type is int32
, so fails to fit this value.I can reproduce the exact same output if I tell numpy to convert the output to int32
:
>>> np.sum(np.power(range(26), 7)).astype(np.int32)
792709145
but because I have am running MacOS on a 64-bit CPU, numpy uses int64
as the default integer type and so produces the correct result:>>> np.sum(np.power(range(26), 7))
22267545625
>>> np.sum(np.power(range(26), 7)).dtype
dtype('int64')
792709145
is the integer value represented by the bottom 31 bits:>>> print(int(format(22267545625, 'b')[-31:], 2))
792709145
On Windows however, the default numpy integer type is int32 because even on 64-bit hardware Windows defines C long int as a 32-bit value.You should tell numpy to create an array of int64 values here:
np.sum(np.power(np.arange(26, dtype=np.int64), 7))
Pandas Dataframe: Why is astype method producing int32 results with an argument of int
Pandas uses numpy datatypes under the hood. From the numpy documentation,
The default NumPy behavior is to create arrays in either 32 or 64-bitIt is not a bug and you should be specifying dtypes if you have a specific use or want to be platform agnostic. To rephrase your question, what is
signed integers (platform dependent and matches C int size) or double
precision floating point numbers, int32/int64 and float, respectively.
If you expect your integer arrays to be a specific type, then you need
to specify the dtype while you create the array.
np.dtype(int)
on my platform?On windows, as some of the comments suggest, it appears to be a C signed long
(32 bits). You can even get numpy to throw an overflow error to confirm this.
>>> import numpy as np
>>> np.array([2_147_483_648], dtype=int)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long
Why is the dtype shown (even if it's the native one) when using floor division with NumPy?
You actually have multiple distinct 32-bit integer dtypes here. This is probably a bug.
NumPy has (accidentally?) created multiple distinct signed 32-bit integer types, probably corresponding to C int
and long
. Both of them display as numpy.int32
, but they're actually different objects. At C level, I believe the type objects are PyIntArrType_Type
and PyLongArrType_Type
, generated here.
dtype objects have a type
attribute corresponding to the type object of scalars of that dtype. It is this type
attribute that NumPy inspects when deciding whether to print dtype
information in an array's repr
:
_typelessdata = [int_, float_, complex_]
if issubclass(intc, int):
_typelessdata.append(intc)
if issubclass(longlong, int):
_typelessdata.append(longlong)
...
def array_repr(arr, max_line_width=None, precision=None, suppress_small=None):
...
skipdtype = (arr.dtype.type in _typelessdata) and arr.size > 0
if skipdtype:
return "%s(%s)" % (class_name, lst)
else:
...
return "%s(%s,%sdtype=%s)" % (class_name, lst, lf, typename)
On numpy.arange(5)
and numpy.arange(5) + 3
, .dtype.type
is numpy.int_
; on numpy.arange(5) // 3
or numpy.arange(5) % 3
, .dtype.type
is the other 32-bit signed integer type.As for why +
and //
have different output dtypes, they use different type resolution routines. Here's the one for //
, and here's the one for +
. //
's type resolution looks for a ufunc inner loop that takes types the inputs can be safely cast to, while +
's type resolution applies NumPy type promotion to the arguments and picks the loop matching the resulting type.
A bug of Python's power operator **?
These values are too big to store in a 32-bit int
which numpy
uses by default. If you set the datatype to float
(or 64-bit int
) you get the proper results:
import numpy as np
print 2 ** np.array([32, 33], dtype=np.float)
# [ 4.2946730e+09 8.58993459e+09 ]
print 2 ** np.array([32, 33], dtype=np.int64) # 64-bit int as suggested by PM 2Ring
# [ 4294967296 8589934592]
jax linear solve issues
See the bug report for the root cause.
It is a long standing platform-wise behaviour and it just bites one more time, see numpy array dtype is coming as int32 by default in a windows 10 64 bit machine
Related Topics
Django - Makemigrations - No Changes Detected
Subclassing Python Dictionary to Override _Setitem_
How to Convert a Python List into a C Array by Using Ctypes
How to Handle Exceptions in a List Comprehensions
Dump to JSON Adds Additional Double Quotes and Escaping of Quotes
Pandas Dataframe Aggregate Function Using Multiple Columns
How to Print a Dictionary Line by Line in Python
Using Only the Db Part of Django
Pandas Latitude-Longitude to Distance Between Successive Rows
Pandas - Filter Dataframe by Another Dataframe by Row Elements
First Python List Index Greater Than X