Python: Fastest Way to Create a List of N Lists

Python: fastest way to create a list of n lists

The probably only way which is marginally faster than

d = [[] for x in xrange(n)]

is

from itertools import repeat
d = [[] for i in repeat(None, n)]

It does not have to create a new int object in every iteration and is about 15 % faster on my machine.

Edit: Using NumPy, you can avoid the Python loop using

d = numpy.empty((n, 0)).tolist()

but this is actually 2.5 times slower than the list comprehension.

Best and/or fastest way to create lists in python

Let's run some time tests* with timeit.timeit:

>>> from timeit import timeit
>>>
>>> # Test 1
>>> test = """
... my_list = []
... for i in xrange(50):
... my_list.append(0)
... """
>>> timeit(test)
22.384258893239178
>>>
>>> # Test 2
>>> test = """
... my_list = []
... for i in xrange(50):
... my_list += [0]
... """
>>> timeit(test)
34.494779364416445
>>>
>>> # Test 3
>>> test = "my_list = [0 for i in xrange(50)]"
>>> timeit(test)
9.490926919482774
>>>
>>> # Test 4
>>> test = "my_list = [0] * 50"
>>> timeit(test)
1.5340533503559755
>>>

As you can see above, the last method is the fastest by far.


However, it should only be used with immutable items (such as integers). This is because it will create a list with references to the same item.

Below is a demonstration:

>>> lst = [[]] * 3
>>> lst
[[], [], []]
>>> # The ids of the items in `lst` are the same
>>> id(lst[0])
28734408
>>> id(lst[1])
28734408
>>> id(lst[2])
28734408
>>>

This behavior is very often undesirable and can lead to bugs in the code.

If you have mutable items (such as lists), then you should use the still very fast list comprehension:

>>> lst = [[] for _ in xrange(3)]
>>> lst
[[], [], []]
>>> # The ids of the items in `lst` are different
>>> id(lst[0])
28796688
>>> id(lst[1])
28796648
>>> id(lst[2])
28736168
>>>

*Note: In all of the tests, I replaced range with xrange. Since the latter returns an iterator, it should always be faster than the former.

How can I make multiple empty lists in Python?

A list comprehension is easiest here:

>>> n = 5
>>> lists = [[] for _ in range(n)]
>>> lists
[[], [], [], [], []]

Be wary not to fall into the trap that is:

>>> lists = [[]] * 5
>>> lists
[[], [], [], [], []]
>>> lists[0].append(1)
>>> lists
[[1], [1], [1], [1], [1]]

How to create a list of empty lists

For manually creating a specified number of lists, this would be good:

empty_list = [ [], [], ..... ]

In case, you want to generate a bigger number of lists, then putting it inside a for loop would be good:

empty_lists = [ [] for _ in range(n) ]

Fastest way to compare list of lists against list of sets

Original Answer

I will not extend so much, given that the other answer is already very complete, but I got even better performance by using the difference between sets.

Let:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

For comparison, here is JJ Hassan's second example run on my machine (plus, note that I included the not in that was in the original question):

>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now, using the difference between sets:

>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Try this option as well, as I believe that the difference in speed may become more significant with bigger lists.

The idea behind this is that you have a superlist, which is a list of lists, each containing a sublist, or in this case a tuple. But, as per your requirements, the intermediate lists do not need to preserve order (only the superlist and the sublists), and we want to take those elements that are not found in set_of_tuples. Consequently, the intermediate lists can be seen as sets, and the operation of taking the elements that do not belong to set_of_tuples is trivially the difference between sets.

Edit

I just came up with a slightly faster solution by using functools and itertools. This new solution, however, is better only when we enough data.

Let us start with the previous solution:

filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]

Now, by a simple application of map, this becomes:

filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]

Then we can use operator.sub to rewrite this as:

from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]

or, using a plain list:

from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))

Finally, we use map once more, this time bringing itertools.repeat to the game:

from itertools import repeat
from operator import sub

filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))

This new method is actually the slowest given small lists:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)

But now let us define bigger lists. I used approximately the sizes you mentioned in a comment:

>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}

Using this new data, here are the results I got on my machine:

>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)

Fastest way to iterate over multiple lists

Some options that you could consider to do this task are:

  • using Python multiprocessing, that will allow you to run the tasks in parallel: https://docs.python.org/3/library/multiprocessing.html
  • using dask dataframes: https://www.dask.org/ It is a very easy to use library with a very good performance.
  • using pola.rs dataframes: https://www.pola.rs/ Polars has less features that dask and worse documentation, but should have better performances.
  • using PySpark to run your task in a cluster: https://spark.apache.org/docs/latest/api/python/
  • use a faster programming language, for example Rust, C or C++.

Efficient way to create a list a list from class variable. (multiple lists)

first one:

  • complexity of the first one is O(n)
  • complexity of the second one is O(3n)

Create list of single item repeated N times

You can also write:

[e] * n

You should note that if e is for example an empty list you get a list with n references to the same list, not n independent empty lists.

Performance testing

At first glance it seems that repeat is the fastest way to create a list with n identical elements:

>>> timeit.timeit('itertools.repeat(0, 10)', 'import itertools', number = 1000000)
0.37095273281943264
>>> timeit.timeit('[0] * 10', 'import itertools', number = 1000000)
0.5577236771712819

But wait - it's not a fair test...

>>> itertools.repeat(0, 10)
repeat(0, 10) # Not a list!!!

The function itertools.repeat doesn't actually create the list, it just creates an object that can be used to create a list if you wish! Let's try that again, but converting to a list:

>>> timeit.timeit('list(itertools.repeat(0, 10))', 'import itertools', number = 1000000)
1.7508119747063233

So if you want a list, use [e] * n. If you want to generate the elements lazily, use repeat.

Python: Fastest way to find all elements in one large list but not in another

I really like set analysis, where you can do:

set(list2) - set(list1)

Putting list items in a set removes all duplicates & ordering. Set operations allow us to remove a set of items from another set, just with the - operator.

If the list is enormous, numpy is a bit faster.

import numpy as np
np.setdiff1d(list1, list2)


Related Topics



Leave a reply



Submit