In Python, What Is the Fastest Algorithm for Removing Duplicates from a List So That All Elements Are Unique *While Preserving Order*

How do I remove duplicates from a list, while preserving order?

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)

How do I remove duplicates from a list, while preserving order?

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)

One-liner to remove duplicates, keep ordering of list

You could use an OrderedDict, but I suggest sticking with your for-loop.

>>> from collections import OrderedDict
>>> data = ['Herb', 'Alec', 'Herb', 'Don']
>>> list(OrderedDict.fromkeys(data))
['Herb', 'Alec', 'Don']

Just to reiterate: I seriously suggest sticking with your for-loop approach, and use a set to keep track of already seen items:

>>> data = ['Herb', 'Alec', 'Herb', 'Don']
>>> seen = set()
>>> unique_data = []
>>> for x in data:
... if x not in seen:
... unique_data.append(x)
... seen.add(x)
...
>>> unique_data
['Herb', 'Alec', 'Don']

And in case you just want to be wacky (seriously don't do this):

>>> [t[0] for t in sorted(dict(zip(reversed(data), range(len(data), -1, -1))).items(), key=lambda t:t[1])]
['Herb', 'Alec', 'Don']

Remove duplicates from list (algorithm speed)

An extra variable isn't going to be that slow as you would imagine.

In fact, the key is here:

if e not in results:

If the result is True, this is going to be very time-consuming because the whole list is iterated once. That says, with a list of only 10 elements and a huge list of 100,000 elements, the time to run e not in lst varies a lot.

With remove_adjacent_duplicates, you're only looking at the last item, so this comparison takes a constant amount of time and does not vary by list length.

Removing duplicates in lists

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 3, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.

How do you remove duplicates from a list in Python whilst preserving order and length?

Edit: reverse the logic to make the meaning clearer:

Another alternative would be to do something like this:

seen = dict()
seen_setdefault = seen.setdefault
new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]

To give an example:

>>> row = ["to", "be", "or", "not", "to", "be"]
>>> seen = dict()
>>> seen_setdefault = seen.setdefault
>>> new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
>>> new_row
['to', 'be', 'or', 'not', '', '']

Edit 2: Out of curiosity I ran a quick test to see which approach was fastest:

>>> from random import randint
>>> from statistics import mean
>>> from timeit import repeat
>>>
>>> def standard(seq):
... """Trivial modification to standard method for removing duplicates."""
... seen = set()
... seen_add = seen.add
... return ["" if x in seen or seen_add(x) else x for x in seq]
...
>>> def dedup(seq):
... seen = set()
... for v in seq:
... yield '' if v in seen else v
... seen.add(v)
...
>>> def pedro(seq):
... """Pedro's iterator based approach to removing duplicates."""
... my_dedup = dedup
... return [x for x in my_dedup(seq)]
...
>>> def srgerg(seq):
... """Srgerg's dict based approach to removing duplicates."""
... seen = dict()
... seen_setdefault = seen.setdefault
... return ["" if cell in seen else seen_setdefault(cell, cell) for cell in seq]
...
>>> data = [randint(0, 10000) for x in range(100000)]
>>>
>>> mean(repeat("standard(data)", "from __main__ import data, standard", number=100))
1.2130275770426708
>>> mean(repeat("pedro(data)", "from __main__ import data, pedro", number=100))
3.1519048346103555
>>> mean(repeat("srgerg(data)", "from __main__ import data, srgerg", number=100))
1.2611971098676882

As can be seen from the results, making a relatively simple modification to the standard approach described in this other stack-overflow question is fastest.

How do I remove duplicates from a list, while preserving order?

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)

Deleting repeats in a list python

Please see the Python documentation for three ways to accomplish this. The following is copied from that site. Replace the example 'mylist' with your variable name ('list').

First Example: If you don’t mind reordering the list, sort it and then scan from the end of the list, deleting duplicates as you go:

if mylist:
mylist.sort()
last = mylist[-1]
for i in range(len(mylist)-2, -1, -1):
if last == mylist[i]:
del mylist[i]
else:
last = mylist[i]

Second Example: If all elements of the list may be used as dictionary keys (i.e. they are all hashable) this is often faster:

d = {}
for x in mylist:
d[x] = 1
mylist = list(d.keys())

Third Example: In Python 2.5 and later:

mylist = list(set(mylist))


Related Topics



Leave a reply



Submit