How Does _Contains_ Work for Ndarrays

How does contains work for ndarrays?

I found the source for ndarray.__contains__, in numpy/core/src/multiarray/sequence.c. As a comment in the source states,

thing in x

is equivalent to

(x == thing).any()

for an ndarray x, regardless of the dimensions of x and thing. This only makes sense when thing is a scalar; the results of broadcasting when thing isn't a scalar cause the weird results I observed, as well as oddities like array([1, 2, 3]) in array(1) that I didn't think to try. The exact source is

static int
array_contains(PyArrayObject *self, PyObject *el)
{
    /* equivalent to (self == el).any() */

    int ret;
    PyObject *res, *any;

    res = PyArray_EnsureAnyArray(PyObject_RichCompare((PyObject *)self,
                                                      el, Py_EQ));
    if (res == NULL) {
        return -1;
    }
    any = PyArray_Any((PyArrayObject *)res, NPY_MAXDIMS, NULL);
    Py_DECREF(res);
    ret = PyObject_IsTrue(any);
    Py_DECREF(any);
    return ret;
}

Concatenate a NumPy array to another NumPy array

In [1]: import numpy as np

In [2]: a = np.array([[1, 2, 3], [4, 5, 6]])

In [3]: b = np.array([[9, 8, 7], [6, 5, 4]])

In [4]: np.concatenate((a, b))
Out[4]: 
array([[1, 2, 3],
       [4, 5, 6],
       [9, 8, 7],
       [6, 5, 4]])

or this:

In [1]: a = np.array([1, 2, 3])

In [2]: b = np.array([4, 5, 6])

In [3]: np.vstack((a, b))
Out[3]: 
array([[1, 2, 3],
       [4, 5, 6]])

How do I count the occurrence of a certain item in an ndarray?

Using numpy.unique:

import numpy
a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
unique, counts = numpy.unique(a, return_counts=True)

>>> dict(zip(unique, counts))
{0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

Non-numpy method using collections.Counter;

import collections, numpy
a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
counter = collections.Counter(a)

>>> counter
Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})

Why does numpy have a corresponding function for many ndarray methods?

As others have noted, the identically-named NumPy functions and array methods are often equivalent (they end up calling the same underlying code). One might be preferred over the other if it makes for easier reading.

However, in some instances the two behave different slightly differently. In particular, using the ndarray method sometimes emphasises the fact that the method is modifying the array in-place.

For example, np.resize returns a new array with the specified shape. On the other hand, ndarray.resize changes the shape of the array in-place. The fill values used in each case are also different.

Similarly, a.sort() sorts the array a in-place, while np.sort(a) returns a sorted copy.

Python numpy array of numpy arrays

Never append to numpy arrays in a loop: it is the one operation that NumPy is very bad at compared with basic Python. This is because you are making a full copy of the data each append, which will cost you quadratic time.

Instead, just append your arrays to a Python list and convert it at the end; the result is simpler and faster:

a = []

while ...:
    b = ... # NumPy array
    a.append(b)
a = np.asarray(a)

As for why your code doesn't work: np.append doesn't behave like list.append at all. In particular, it won't create new dimensions when appending. You would have to create the initial array with two dimensions, then append with an explicit axis argument.

How to convert list of numpy arrays into single numpy array?

In general you can concatenate a whole sequence of arrays along any axis:

numpy.concatenate( LIST, axis=0 )

but you do have to worry about the shape and dimensionality of each array in the list (for a 2-dimensional 3x5 output, you need to ensure that they are all 2-dimensional n-by-5 arrays already). If you want to concatenate 1-dimensional arrays as the rows of a 2-dimensional output, you need to expand their dimensionality.

As Jorge's answer points out, there is also the function stack, introduced in numpy 1.10:

numpy.stack( LIST, axis=0 )

This takes the complementary approach: it creates a new view of each input array and adds an extra dimension (in this case, on the left, so each n-element 1D array becomes a 1-by-n 2D array) before concatenating. It will only work if all the input arrays have the same shape.

vstack (or equivalently row_stack) is often an easier-to-use solution because it will take a sequence of 1- and/or 2-dimensional arrays and expand the dimensionality automatically where necessary and only where necessary, before concatenating the whole list together. Where a new dimension is required, it is added on the left. Again, you can concatenate a whole list at once without needing to iterate:

numpy.vstack( LIST )

This flexible behavior is also exhibited by the syntactic shortcut numpy.r_[ array1, ...., arrayN ] (note the square brackets). This is good for concatenating a few explicitly-named arrays but is no good for your situation because this syntax will not accept a sequence of arrays, like your LIST.

There is also an analogous function column_stack and shortcut c_[...], for horizontal (column-wise) stacking, as well as an almost-analogous function hstack—although for some reason the latter is less flexible (it is stricter about input arrays' dimensionality, and tries to concatenate 1-D arrays end-to-end instead of treating them as columns).

Finally, in the specific case of vertical stacking of 1-D arrays, the following also works:

numpy.array( LIST )

...because arrays can be constructed out of a sequence of other arrays, adding a new dimension to the beginning.

What is the fastest way to stack numpy arrays in a loop?

What @hpaulj was trying to say with

Stick with list append when doing loops.

#use a normal list
result_arr = []

for label in labels_set:

    data_transform = pca.fit_transform(data_sub_tfidf) 

    # append the data_transform object to that list
    # Note: this is not np.append(), which is slow here
    result_arr.append(data_transform)

# and stack it after the loop
# This prevents slow memory allocation in the loop. 
# So only one large chunk of memory is allocated since
# the final size of the concatenated array is known.

result_arr = np.concatenate(result_arr)

# or 
result_arr = np.stack(result_arr, axis=0)

# or
result_arr = np.vstack(result_arr)

Your arrays don't really have different dimensions. They have one different dimension, the other one is identical. And in that case you can always stack along the "different" dimension.

How Does _Contains_ Work for Ndarrays