How to Create a Numpy Array of Arbitrary Length Strings

How to create a numpy array of arbitrary length strings?

You can do so by creating an array of dtype=object. If you try to assign a long string to a normal numpy array, it truncates the string:

>>> a = numpy.array(['apples', 'foobar', 'cowboy'])
>>> a[2] = 'bananas'
>>> a
array(['apples', 'foobar', 'banana'],
dtype='|S6')

But when you use dtype=object, you get an array of python object references. So you can have all the behaviors of python strings:

>>> a = numpy.array(['apples', 'foobar', 'cowboy'], dtype=object)
>>> a
array([apples, foobar, cowboy], dtype=object)
>>> a[2] = 'bananas'
>>> a
array([apples, foobar, bananas], dtype=object)

Indeed, because it's an array of objects, you can assign any kind of python object to the array:

>>> a[2] = {1:2, 3:4}
>>> a
array([apples, foobar, {1: 2, 3: 4}], dtype=object)

However, this undoes a lot of the benefits of using numpy, which is so fast because it works on large contiguous blocks of raw memory. Working with python objects adds a lot of overhead. A simple example:

>>> a = numpy.array(['abba' for _ in range(10000)])
>>> b = numpy.array(['abba' for _ in range(10000)], dtype=object)
>>> %timeit a.copy()
100000 loops, best of 3: 2.51 us per loop
>>> %timeit b.copy()
10000 loops, best of 3: 48.4 us per loop

Numpy array of strings, value assignation

You can see it as a form of overflow.

Have a look at the exact types of your arrays:

>>> a.dtype
dtype('<U1') # Array of 1 unicode char
>>> b.dtype
dtype('<U4') # array of 4 unicode chars

When you define an array of strings, numpy tries to infer the smallest size of string it that can contain all the elements you defined.

  • for a , 1 character is enough
  • for b, TEST is 4 chars long

Then, when you assign a new value to any new element of an array of strings, numpy will truncate the new value to the capacity of the array. Here, it keeps only the first letter of TEST, T.

Your slicing operation has nothing to do with it:

a = np.zeros(1, dtype=str)
a[0] = 'hello world'
print(a[0])
# h

How to overcome it

  1. define a with a dtype of object: numpy will not try to optimize its storage space anymore, and you'll get a predictable behaviour
  2. Increase the size of the char array: a = np.zero(10, dtype='U256') will increase the capacity of each cell to 256 characters

Initialise numpy array of unknown length

Build a Python list and convert that to a Numpy array. That takes amortized O(1) time per append + O(n) for the conversion to array, for a total of O(n).

    a = []
for x in y:
a.append(x)
a = np.array(a)

Weird behaviour initializing a numpy array of string data

Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.

You can pass an explicit datatype with your maximum length by doing, e.g.:

my_array = numpy.empty([1, 2], dtype="S10")

The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.

numpy recarray strings of variable length

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)],
dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)

Length of each string in a NumPy array

You can use vectorize of numpy. It is much faster.

mylen = np.vectorize(len)
print mylen(arr)

How to assign a string value to an array in numpy?

You get the error because NumPy's array is homogeneous, meaning it is a multidimensional table of elements all of the same type. This is different from a multidimensional list-of-lists in "regular" Python, where you can have objects of different type in a list.

Regular Python:

>>> CoverageACol = [[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]]

>>> CoverageACol[0][0] = "hello"

>>> CoverageACol
[['hello', 1, 2, 3, 4],
[5, 6, 7, 8, 9]]

NumPy:

>>> from numpy import *

>>> CoverageACol = arange(10).reshape(2,5)

>>> CoverageACol
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

>>> CoverageACol[0,0] = "Hello"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)

/home/biogeek/<ipython console> in <module>()

ValueError: setting an array element with a sequence.

So, it depends on what you want to achieve, why do you want to store a string in an array filled for the rest with numbers? If that really is what you want, you can set the datatype of the NumPy array to string:

>>> CoverageACol = array(range(10), dtype=str).reshape(2,5)

>>> CoverageACol
array([['0', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S1')

>>> CoverageACol[0,0] = "Hello"

>>> CoverageACol
array([['H', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S1')

Notice that only the first letter of Hello gets assigned. If you want the whole word to get assigned, you need to set an array-protocol type string:

>>> CoverageACol = array(range(10), dtype='a5').reshape(2,5)

>>> CoverageACol:
array([['0', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S5')

>>> CoverageACol[0,0] = "Hello"

>>> CoverageACol
array([['Hello', '1', '2', '3', '4'],
['5', '6', '7', '8', '9']],
dtype='|S5')


Related Topics



Leave a reply



Submit