What Are the Advantages of Numpy Over Regular Python Lists

What are the advantages of NumPy over regular Python lists?

NumPy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of hardware!

The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A NumPy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!

What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to SPEED?

The biggest usual benefits of numpy, as far as speed goes, come from being able to vectorize operations, which means you replace a Python loop around a Python function call with a C loop around some inlined C (or even custom SIMD assembly) code. There are probably no built-in vectorized operations for arrays of mpfr objects, so that main benefit vanishes.

However, there are some place you'll still benefit:

  • Some operations that would require a copy in pure Python are essentially free in numpy—transposing a 2D array, slicing a column or a row, even reshaping the dimensions are all done by wrapping a pointer to the same underlying data with different striding information. Since your initial question specifically asked about A.T, yes, this is essentially free.
  • Many operations can be performed in-place more easily in numpy than in Python, which can save you some more copies.
  • Even when a copy is needed, it's faster to bulk-copy a big array of memory and then refcount all of the objects than to iterate through nested lists deep-copying them all the way down.
  • It's a lot easier to write your own custom Cython code to vectorize an arbitrary operation with numpy than with Python.
  • You can still get some benefit from using np.vectorize around a normal Python function, pretty much on the same order as the benefit you get from a list comprehension over a for statement.
  • Within certain size ranges, if you're careful to use the appropriate striding, numpy can allow you to optimize cache locality (or VM swapping, at larger sizes) relatively easily, while there's really no way to do that at all with lists of lists. This is much less of a win when you're dealing with an array of pointers to objects that could be scattered all over memory than when dealing with values that can be embedded directly in the array, but it's still something.

As for disadvantages… well, one obvious one is that using numpy restricts you to CPython or sometimes PyPy (hopefully in the future that "sometimes" will become "almost always", but it's not quite there as of 2014); if your code would run faster in Jython or IronPython or non-NumPyPy PyPy, that could be a good reason to stick with lists.

NumPy Array vs. a regular Python list

Some useful information can be found here and here.

The main advantage of numpy arrays is that they are much, much faster than Python lists when performing most numerical operations. For instance, multiplying every element in a sequence by a single other constant, or multiplying every element in one sequence by the corresponding element in another sequence, is much faster in numpy. In addition, for multidimensional structures, numpy arrays support more powerful indexing, for instance allowing you to slice by both rows and columns.

Because of this fundamental advantage, numpy arrays have become the de facto standard for basically every Python project that does heavy number crunching. This means that many other tools are built on numpy arrays (for instance, graphing with matplotlib, machine learning with scikit-learn, etc.).

Are numpy arrays faster than lists?

introducing numpy in your code means introduce another kind of thinking at problems.

numpy in general are more expensive in the creation phase but blazing fast for vectorial computation.

I think this is more or less what you are searching for

timeit(stmt='list(map(math.sqrt,a_list))', setup='import math; a_list = list(range(1,100000))',number=1000)
#8.64

vs:

timeit(stmt='numpy.sqrt(a_list)', setup='import numpy; a_list = numpy.arange(100000)',number=1000)
#0.4

These are the results of timeit, only applied on the computation part. I mean, computing the sqrt and ignoring time to import the libraries.

NumPy's ndarrays vs Python's lists

Python lists are more bulky. They're basically arrays of pointers, which take up far more memory than numpy's ndarrays. As a result, for mathematical operations involving matrices and complex calculations, ndarrays are the better option. Because of this, most mathematical operations have been optimized for numpy and there are more mathematically useful functions for ndarrays.

Python lists are much more flexible, though. They can hold heterogeneous, arbitrary data, and appending/removing is very efficient. If you'd like to add and remove many different objects, Python lists are the way to go.

For the purpose of machine learning, ndarrays are definitely your best bet. Tensorflow and keras, the two most popular machine learning libraries, are more suited to numpy's memory-efficient arrays because they deal with large amounts of homogeneous data.

Performance of Numpy Array vs Python List over 1D matrix(Vector)

TL;DR

Use np.sum() like this:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

Testing

Now lets do some some experiments. We will use the following setup:

import pandas as pd
import numpy as np
tr_y = pd.DataFrame({'labels': [6, 5, 6, 5, 6]*200000})

We use a larger dataset to see whether the methods scale well to larger inputs. Here we will have a dataset with 1,000,000 rows. We will try a couple of different methods and see how they perform.

The worst performer is:

sum(tr_y.labels.to_numpy()==5)/len(tr_y) 

1.91 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The next option is on average 14 times faster:

y_list = tr_y.to_numpy().tolist()
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)

132 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

After that we get a 1.6 times increase with:

sum(tr_y.labels==5)/len(tr_y)

79.3 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

None of these methods however are optimised with numpy. They use numpy arrays but are bogged down by python's sum(). If we use the optimised NumPy version we get:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

1.36 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This operation was on average 58 times faster than our previous best. This is more like the power of NumPy that we were promised. By using np.sum() instead of python's standard sum(), we are able to do the same operation about 1,400 times faster (1.9 s vs 1.4 ms)

Closing Notes

Since Pandas series are built on NumPy arrays the following code gives very similar performance to our optimal setup.:

np.sum(tr_y.labels==5)/len(tr_y)

1.83 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unless optimizing your code is essential, I would personally go for this option as it is the clearest to read without losing much performance.



Related Topics



Leave a reply



Submit