Python List VS. Array - When to Use

Python list vs. array – when to use?

Basically, Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your list time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays, in part because each item in the list requires the construction of an individual Python object, even for data that could be represented with simple C types (e.g. float or uint64_t).

The array.array type, on the other hand, is just a thin wrapper on C arrays. It can hold only homogeneous data (that is to say, all of the same type) and so it uses only sizeof(one object) * length bytes of memory. Mostly, you should use it when you need to expose a C array to an extension or a system call (for example, ioctl or fctnl).

array.array is also a reasonable way to represent a mutable string in Python 2.x (array('B', bytes)). However, Python 2.6+ and 3.x offer a mutable byte string as bytearray.

However, if you want to do math on a homogeneous array of numeric data, then you're much better off using NumPy, which can automatically vectorize operations on complex multi-dimensional arrays.

To make a long story short: array.array is useful when you need a homogeneous C array of data for reasons other than doing math.

Python List vs Array performance and profiles

Since your main concern is performance and you are dealing with numbers then Python's array module will be your answer. From the official Python 3 docs:

This module defines an object type which can compactly represent an array of basic values: characters, integers, floating point numbers. Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character. The following type codes are defined: Type Code Table.

This type constraint is done to allow an efficient array implementation on the interpreter side, CPython for example. The type codes are a bridge between being dynamically typed (Python) and statically typed (C in case of CPython).

Otherwise using a list, you will usually take some performance loss since a list can handle all types. I should caveat that the performance loss is negligible for smaller data-sets/operation rates.

Python Arrays vs Lists

The arrays will take less space.

I've never used the array module, numpy provides the same benefits plus many many more.

what does python array module do and why should i use it instead of a list?

The difference between an array and a list is that the type of object stored in the array container is constrained. In line 2:

a = array.array('i')

You are initializing an array of signed ints.

A list allows you to have a combination of varying datatypes (both custom and basic) if desired. For example:

l = [13, 'hello']

You would choose to use an array over a list for efficiency purposes if you can assure the datatype found in the array to all be the same and of a certain, basic type.

More information can be found here: Description of Array Module Usage

Python Array Memory Footprint versus List

You just picked the wrong example. The point of using array is when you need to store items whose native representation is smaller than that of a Python object reference. (Which seems to be 8 bytes here.) E.g. if you do:

from array import array
from os import urandom
a = array('B', urandom(1024))
l = list(a)
sys.getsizeof(a) # => 1155
sys.getsizeof(l) # => 9328

Since doubles are also 8 bytes wide there really isn't a more compact way to store them than a different 8 bytes.


As for the rest of the claims in the book take them with a grain of salt - you can't run Python code - that is, have operations be executed by the Python interpreter - and be as fast as C. You're still incurring overhead when writing Python objects to or reading them from the array, what would be faster is doing some sort of big operation over the entire array in a native function.

Python Array and a List

I think this explains it well. https://learnpython.com/blog/python-array-vs-list/

Some key reasons to use arrays instead of lists:

  1. Arrays store information more compactly, making them more efficient and powerful.
  2. Arrays are good for numerical operation.

Some key reasons to use lists instead of arrays:

  1. Lists are good for grouping together multiple data types.
  2. Lists are mutable, meaning that they can be changed.

Though lists and arrays might have similar features, they are used for different cases.

Performance of Numpy Array vs Python List over 1D matrix(Vector)

TL;DR

Use np.sum() like this:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

Testing

Now lets do some some experiments. We will use the following setup:

import pandas as pd
import numpy as np
tr_y = pd.DataFrame({'labels': [6, 5, 6, 5, 6]*200000})

We use a larger dataset to see whether the methods scale well to larger inputs. Here we will have a dataset with 1,000,000 rows. We will try a couple of different methods and see how they perform.

The worst performer is:

sum(tr_y.labels.to_numpy()==5)/len(tr_y) 

1.91 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The next option is on average 14 times faster:

y_list = tr_y.to_numpy().tolist()
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)

132 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

After that we get a 1.6 times increase with:

sum(tr_y.labels==5)/len(tr_y)

79.3 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

None of these methods however are optimised with numpy. They use numpy arrays but are bogged down by python's sum(). If we use the optimised NumPy version we get:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

1.36 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This operation was on average 58 times faster than our previous best. This is more like the power of NumPy that we were promised. By using np.sum() instead of python's standard sum(), we are able to do the same operation about 1,400 times faster (1.9 s vs 1.4 ms)

Closing Notes

Since Pandas series are built on NumPy arrays the following code gives very similar performance to our optimal setup.:

np.sum(tr_y.labels==5)/len(tr_y)

1.83 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unless optimizing your code is essential, I would personally go for this option as it is the clearest to read without losing much performance.

Python: list vs. np.array: switching to use certain attributes

A numpy array:

>>> A=np.array([1,4,9,2,7])

delete:

>>> A=np.delete(A, [2,3])
>>> A
array([1, 4, 7])

append (beware: it's O(n), unlike list.append which is O(1)):

>>> A=np.append(A, [5,0])
>>> A
array([1, 4, 7, 5, 0])

sort:

>>> np.sort(A)
array([0, 1, 4, 5, 7])

index:

>>> A
array([1, 4, 7, 5, 0])
>>> np.where(A==7)
(array([2]),)

Numpy array vs python list with time measurement

It's misguiding to say that a code with arrays runs faster. In general, when people speak about code going faster with Numpy arrays, they point out the benefit of vectorizing the code which can then run faster with Numpy performant implementation of functions for array operations and manipulation. I wrote a code that compares the second code you provided (I slightly modified it to save values on a list in order to compare the results) and I compared it to a vectorized version. You can see that this leads to a significant increase in time and the results given by the two methods are equal:

from time import time
import numpy as np

# We use a list and apply the operation for all elemnt in it => return a list
def stresslist(STEEL_E_MODULE, strain_in_rebars_list, STEEL_STRENGTH):
stress_list = []
for strain_in_rebars in strain_in_rebars_list:
STRESS = STEEL_E_MODULE * strain_in_rebars
if STRESS <= - STEEL_STRENGTH:
stress = - STEEL_STRENGTH
elif STRESS >= - STEEL_STRENGTH and STRESS < 0:
stress = STRESS
else:
stress = min(STRESS, STEEL_STRENGTH)
stress_list.append(stress)
return stress_list

# We use a numpy array and apply the operation for all elemnt in it => return a array
def stressnumpy(STEEL_E_MODULE, strain_in_rebars_array, STEEL_STRENGTH):
STRESS = STEEL_E_MODULE * strain_in_rebars_array
stress_array = np.where(
STRESS <= -STEEL_STRENGTH, -STEEL_STRENGTH,
np.where(
np.logical_and(STRESS >= -STEEL_STRENGTH, STRESS < 0), STRESS,
np.minimum(STRESS, STEEL_STRENGTH)
)
)
return stress_array

t_start = time()
x = stresslist( 2, list(np.arange(-1000,1000,0.01)), 20)
print(f'time with list: {time()-t_start}s')

t_start = time()
y = stressnumpy( 2, np.arange(-1000,1000,0.01), 20)
print(f'time with numpy: {time()-t_start}s')

print('Results are equal!' if np.allclose(y,np.array(x)) else 'Results differ!')

Output:

% python3 script.py
time with list: 0.24164390563964844s
time with numpy: 0.003011941909790039s
Results are equal!

Please do not hesitate if you have questions about the code. You can also refer to the official documentation for numpy.where, numpy.logical_and and numpy.minimum.

Python numpy array vs list

Your first example could be speed up. Python loop and access to individual items in a numpy array are slow. Use vectorized operations instead:

import numpy as np
x = np.arange(1000000).cumsum()

You can put unbounded Python integers to numpy array:

a = np.array([0], dtype=object)
a[0] += 1232234234234324353453453

Arithmetic operations compared to fixed-sized C integers would be slower in this case.



Related Topics



Leave a reply



Submit