Removing Duplicates in Lists

Removing duplicates in lists

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 3, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.

Python: Remove duplicate lists in list of lists

Try this :

list_ = [["A"], ["B"], ["A","B"], ["B","A"], ["A","B","C"], ["B", "A", "C"]]
l = list(map(list, set(map(tuple, map(set, list_)))))

Output :

[['A', 'B'], ['B'], ['A', 'B', 'C'], ['A']]

This process goes through like :

  1. First convert each sub-list into a set. Thus ['A', 'B'] and ['B', 'A'] both are converted to {'A', 'B'}.
  2. Now convert each of them to a tuple for removing duplicate items as set() operation can not be done with set sub-items in the list.
  3. With set() operation make a list of unique tuples.
  4. Now convert each tuple items in the list into list type.

This is equivalent to :

list_ = [['A'], ['B'], ['A', 'B'], ['B', 'A'], ['A', 'B', 'C'], ['B', 'A', 'C']]
l0 = [set(i) for i in list_]
# l0 = [{'A'}, {'B'}, {'A', 'B'}, {'A', 'B'}, {'A', 'B', 'C'}, {'A', 'B', 'C'}]
l1 = [tuple(i) for i in l0]
# l1 = [('A',), ('B',), ('A', 'B'), ('A', 'B'), ('A', 'B', 'C'), ('A', 'B', 'C')]
l2 = set(l1)
# l2 = {('A', 'B'), ('A',), ('B',), ('A', 'B', 'C')}
l = [list(i) for i in l2]
# l = [['A', 'B'], ['A'], ['B'], ['A', 'B', 'C']]

Removing duplicates from a list of lists

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it's so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that's pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one's apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it's easier to use it at a shell prompt. For example, here's a short module to showcase the general approach for this problem, save it as nodup.py:

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
ks = sorted(k)
return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
ks = sorted(k)
return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
newk = []
for i in k:
if i not in newk:
newk.append(i)
return newk

# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
savek = list(k)
for f in doset, dosort, dogroupby, donewk:
resk = f(k)
assert k == savek
print '%10s %s' % (f.__name__, sorted(resk))

Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

Now we can run checks on the tiny example list:

$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop

the quadratic approach isn't bad, but the sort and groupby ones are better. Etc, etc.

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it's worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

It's also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program's performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

Removing duplicates from a list of lists based on a comparison of an element of the inner lists

You can group the elements using a dict, always keeping the sublist with the smaller second element:

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = {}
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub

Also you can pass two keys to sorted, you don't need to call sorted twice:

In [3]:  l = [[1,4,6,2],[2,2,4,6],[1,2,4,5]]
In [4]: sorted(l,key=lambda x: (-x[1],x[0]))
Out[4]: [[1, 4, 6, 2], [1, 2, 4, 5], [2, 2, 4, 6]]

If you wanted to maintain order in the dict as per ordering needs to be preserved.:

from collections import OrderedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = OrderedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[sub[0]] = sub

But not sure how that fits as you are sorting the data after so you will lose any order.

What you may find very useful is a sortedcontainers.sorteddict:

A SortedDict provides the same methods as a dict. Additionally, a SortedDict efficiently maintains its keys in sorted order. Consequently, the keys method will return the keys in sorted order, the popitem method will remove the item with the highest key, etc.

An optional key argument defines a callable that, like the key argument to Python’s sorted function, extracts a comparison key from each dict key. If no function is specified, the default compares the dict keys directly. The key argument must be provided as a positional argument and must come before all other arguments.

from sortedcontainers import SortedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = SortedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub


print(list(d.values()))

It has all the methods you want bisect, bisect_left etc..

How to remove duplicate lists in a list of list?

You could use a set:

b_set = set(map(tuple,a))  #need to convert the inner lists to tuples so they are hashable
b = map(list,b_set) #Now convert tuples back into lists (maybe unnecessary?)

Or, if you prefer list comprehensions/generators:

b_set = set(tuple(x) for x in a)
b = [ list(x) for x in b_set ]

Finally, if order is important, you can always sort b:

b.sort(key = lambda x: a.index(x) )

Removing duplicates in lists within a list

lists_dedupe(Ls, Ds) :-
maplist(list_to_set, Ls, Ds).

e.g. your example:

?- lists_dedupe([[x,x],[x,y],[y,y,z]], R).
R = [[x], [x, y], [y, z]]

Sets are basically lists without duplicates, so this maps 'convert to set' over the input to get the output.



Related Topics



Leave a reply



Submit