How to create a numpy array of arbitrary length strings?
You can do so by creating an array of dtype=object
. If you try to assign a long string to a normal numpy array, it truncates the string:
>>> a = numpy.array(['apples', 'foobar', 'cowboy'])
>>> a[2] = 'bananas'
>>> a
array(['apples', 'foobar', 'banana'],
dtype='|S6')
But when you use dtype=object
, you get an array of python object references. So you can have all the behaviors of python strings:
>>> a = numpy.array(['apples', 'foobar', 'cowboy'], dtype=object)
>>> a
array([apples, foobar, cowboy], dtype=object)
>>> a[2] = 'bananas'
>>> a
array([apples, foobar, bananas], dtype=object)
Indeed, because it's an array of objects, you can assign any kind of python object to the array:
>>> a[2] = {1:2, 3:4}
>>> a
array([apples, foobar, {1: 2, 3: 4}], dtype=object)
However, this undoes a lot of the benefits of using numpy, which is so fast because it works on large contiguous blocks of raw memory. Working with python objects adds a lot of overhead. A simple example:
>>> a = numpy.array(['abba' for _ in range(10000)])
>>> b = numpy.array(['abba' for _ in range(10000)], dtype=object)
>>> %timeit a.copy()
100000 loops, best of 3: 2.51 us per loop
>>> %timeit b.copy()
10000 loops, best of 3: 48.4 us per loop
Numpy array of strings, value assignation
You can see it as a form of overflow.
Have a look at the exact types of your arrays:
>>> a.dtype
dtype('<U1') # Array of 1 unicode char
>>> b.dtype
dtype('<U4') # array of 4 unicode chars
When you define an array of strings, numpy
tries to infer the smallest size of string it that can contain all the elements you defined.
- for
a
, 1 character is enough - for
b
,TEST
is 4 chars long
Then, when you assign a new value to any new element of an array of strings, numpy will truncate the new value to the capacity of the array. Here, it keeps only the first letter of TEST
, T
.
Your slicing operation has nothing to do with it:
a = np.zeros(1, dtype=str)
a[0] = 'hello world'
print(a[0])
# h
How to overcome it
- define
a
with a dtype of object: numpy will not try to optimize its storage space anymore, and you'll get a predictable behaviour - Increase the size of the char array:
a = np.zero(10, dtype='U256')
will increase the capacity of each cell to 256 characters
Initialise numpy array of unknown length
Build a Python list and convert that to a Numpy array. That takes amortized O(1) time per append + O(n) for the conversion to array, for a total of O(n).
a = []
for x in y:
a.append(x)
a = np.array(a)
Weird behaviour initializing a numpy array of string data
Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str
, it sets this maximum length to 1 by default. You can see if you do my_array.dtype
; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.
You can pass an explicit datatype with your maximum length by doing, e.g.:
my_array = numpy.empty([1, 2], dtype="S10")
The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.
numpy recarray strings of variable length
Instead of using the STRING
dtype, one can always use object
as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:
>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)],
dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])
It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)
Length of each string in a NumPy array
You can use vectorize
of numpy
. It is much faster.
mylen = np.vectorize(len)
print mylen(arr)
How to assign a string value to an array in numpy?
You get the error because NumPy's array is homogeneous, meaning it is a multidimensional table of elements all of the same type. This is different from a multidimensional list-of-lists in "regular" Python, where you can have objects of different type in a list.
Regular Python:
>>> CoverageACol = [[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]]
>>> CoverageACol[0][0] = "hello"
>>> CoverageACol
[['hello', 1, 2, 3, 4],
[5, 6, 7, 8, 9]]
NumPy:
>>> from numpy import *
>>> CoverageACol = arange(10).reshape(2,5)
>>> CoverageACol
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> CoverageACol[0,0] = "Hello"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/biogeek/<ipython console> in <module>()
ValueError: setting an array element with a sequence.
So, it depends on what you want to achieve, why do you want to store a string in an array filled for the rest with numbers? If that really is what you want, you can set the datatype of the NumPy array to string:
>>> CoverageACol = array(range(10), dtype=str).reshape(2,5)
>>> CoverageACol
array([['0', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S1')
>>> CoverageACol[0,0] = "Hello"
>>> CoverageACol
array([['H', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S1')
Notice that only the first letter of Hello
gets assigned. If you want the whole word to get assigned, you need to set an array-protocol type string:
>>> CoverageACol = array(range(10), dtype='a5').reshape(2,5)
>>> CoverageACol:
array([['0', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S5')
>>> CoverageACol[0,0] = "Hello"
>>> CoverageACol
array([['Hello', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S5')
Related Topics
Ambiguity in Pandas Dataframe/Numpy Array "Axis" Definition
Split Views.Py in Several Files
Unbalanced Data and Weighted Cross Entropy
How Does My Input Not Equal the Answer
Too Many Values to Unpack Calling Cv2.Findcontours
CSV New-Line Character Seen in Unquoted Field Error
Nested Ssh Session with Paramiko
Tkinter Grid_Forget Is Clearing the Frame
Can You List the Keyword Arguments a Function Receives
How to Check If an Object Is a List or Tuple (But Not String)
Check If a Given Key Already Exists in a Dictionary and Increment It
How to Improve My Paw Detection
Preserving Global State in a Flask Application
Why Does CSVwriter.Writerow() Put a Comma After Each Character