How to Use Itertools.Groupby()

How do I use itertools.groupby()?

IMPORTANT NOTE: You have to sort your data first.


The part I didn't get is that in the example construction

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)

k is the current grouping key, and g is an iterator that you can use to iterate over the group defined by that grouping key. In other words, the groupby iterator itself returns iterators.

Here's an example of that, using clearer variable names:

from itertools import groupby

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print("A %s is a %s." % (thing[1], key))
print("")

This will give you the output:

A bear is a animal.

A duck is a animal.

A cactus is a plant.

A speed boat is a vehicle.

A school bus is a vehicle.

In this example, things is a list of tuples where the first item in each tuple is the group the second item belongs to.

The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with.

Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key.

In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key. You can use the returned iterator to iterate over each individual item in that group.

Here's a slightly different example with the same data, using a list comprehension:

for key, group in groupby(things, lambda x: x[0]):
listOfThings = " and ".join([thing[1] for thing in group])
print(key + "s: " + listOfThings + ".")

This will give you the output:

animals: bear and duck.

plants: cactus.

vehicles: speed boat and school bus.

What is itertools.groupby() used for?

To start with, you may read the documentation here.

I will place what I consider to be the most important point first. I hope the reason will become clear after the examples.

ALWAYS SORT ITEMS WITH THE SAME KEY TO BE USED FOR GROUPING SO AS TO AVOID UNEXPECTED RESULTS

itertools.groupby(iterable, key=None or some func)
takes a list of iterables and groups them based on a specified key. The key specifies what action to apply to each individual iterable, the result of which is then used as the heading for each grouping the items; items which end up having same 'key' value will end up in the same group.

The return value is an iterable similar to a dictionary in that it is of the form {key : value}.

Example 1

# note here that the tuple counts as one item in this list. I did not
# specify any key, so each item in the list is a key on its own.
c = groupby(['goat', 'dog', 'cow', 1, 1, 2, 3, 11, 10, ('persons', 'man', 'woman')])
dic = {}
for k, v in c:
dic[k] = list(v)
dic

results in

{1: [1, 1],
'goat': ['goat'],
3: [3],
'cow': ['cow'],
('persons', 'man', 'woman'): [('persons', 'man', 'woman')],
10: [10],
11: [11],
2: [2],
'dog': ['dog']}

Example 2

# notice here that mulato and camel don't show up. only the last element with a certain key shows up, like replacing earlier result
# the last result for c actually wipes out two previous results.

list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
c = groupby(list_things, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic

results in

{'c': ['camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}

Now for the sorted version

 # but observe the sorted version where I have the data sorted first on same key I used for grouping
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
sorted_list = sorted(list_things, key = lambda x: x[0])
print(sorted_list)
print()
c = groupby(sorted_list, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic

results in

['cow', 'cat', 'camel', 'dog', 'donkey', 'goat', 'mulato', 'mongoose', 'malloo', ('persons', 'man', 'woman'), 'wombat']
{'c': ['cow', 'cat', 'camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mulato', 'mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}

Example 3

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "harley"), \
("vehicle", "speed boat"), ("vehicle", "school bus")]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic

results in

{'animal': [('animal', 'bear'), ('animal', 'duck')],
'plant': [('plant', 'cactus')],
'vehicle': [('vehicle', 'harley'),
('vehicle', 'speed boat'),
('vehicle', 'school bus')]}

Now for the sorted version. I changed the tuples to lists here. Same results either way.

things = [["animal", "bear"], ["animal", "duck"], ["vehicle", "harley"], ["plant", "cactus"], \
["vehicle", "speed boat"], ["vehicle", "school bus"]]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic

results in

{'animal': [['animal', 'bear'], ['animal', 'duck']],
'plant': [['plant', 'cactus']],
'vehicle': [['vehicle', 'harley'],
['vehicle', 'speed boat'],
['vehicle', 'school bus']]}

How to use python groupby()

Grouping input by common key elements with groupby() only works on input already sorted by that key:

[...] Generally, the iterable needs to already be sorted on the same key function.

Your example should work like this:

from itertools import groupby

a = sorted([1, 2, 1, 3, 2, 1, 2, 3, 4, 5])

for key, value in groupby(a):
print((len(list(value)), key), end=' ')

If you use groupby() on unorderd input you'll get a new group every time a different key is returned by the key function while iterating through the iterable.

Python - itertools.groupby

That isn't how itertools.groupby works. From the manual:

It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function)

So to achieve the type of grouping you want, you need to sort my_list first:

import itertools

my_list = [
{'name': 'stock1', 'price': 200, 'shares': 100},
{'name': 'stock2', 'price': 1.2, 'shares': 1000},
{'name': 'stock3', 'price': 0.05, 'shares': 200000},
{'name': 'stock1', 'price': 200.2, 'shares': 50}
]

my_list.sort(key=lambda x:x['name'])

by_name = { name: list(items) for name, items in itertools.groupby(
my_list, key=lambda x: x['name'])}

print(by_name)

Output

{'stock1': [{'name': 'stock1', 'price': 200, 'shares': 100},
{'name': 'stock1', 'price': 200.2, 'shares': 50}],
'stock2': [{'name': 'stock2', 'price': 1.2, 'shares': 1000}],
'stock3': [{'name': 'stock3', 'price': 0.05, 'shares': 200000}]
}

python itertools groupby with filter usage

You can use itertools.groupby to group all elements greater than 6 and with groups of length greater than 1. All other elements remain ungrouped.

If we want groups as standalone lists, we can use append. If we want groups flattened, we can use extend.

from itertools import groupby

lst = [1, 2, 3, 3, 6, 8, 8, 10, 2, 5, 7, 7]

result = []
for k, g in groupby(lst):
group = list(g)

if k > 6 and len(group) > 1:
result.append(group)
else:
result.extend(group)

print(result)

Output:

[1, 2, 3, 3, 6, [8, 8], 10, 2, 5, [7, 7]]

How to specify types for itertools groupby?

Refering to this post i noticed that the return type of group by is:

Iteratable[Tuple[<key_type>, Iterable[<item_type>]]]

So, the type in my case would be:

group_iter: Iterator[Tuple[int, Iterable[str]]]  = groupby(sorted(names,key=len), lambda x: len(x))

How to use itertools.groupby with a true/false lambda function

itertools.groupby() will return an alternating sequence of countries and cities. When it returns a country, you save the country. When it returns cities, you add an entry to the dictionary with the saved country.

result = {}
for is_country, values in itertools.groupby(filtered, key = lambda line: line.endswith("[country]")):
if is_country:
country = next(values)
else:
result[country] = list(values)

Itertools groupby to group list using another list

For the general case of grouping one iterable based on the matching value in another iterable, you can just make a cheaty key function that iterates the other iterable, e.g. using your original s and g:

>>> from itertools import groupby
>>> print([(k, len(list(grp))) for k, grp in groupby(s, key=lambda _, ig=iter(g): next(ig))])
[(0.0, 3), (2.0, 3), (4.0, 2)]

The key function accepts the value from s and ignores it, instead returning the matching value from iterating g manually (the defaulted second argument caches an iterator created from g, then next is used to manually advance it each time; pass a second argument to next to silently ignore mismatched lengths and simply substitute in a default value).

Obviously, for this specific case there are better approaches, but I'm answering the general question asked, not the specific example.

Python itertools groupby

The functions groupby and takewhile aren't good fits for this sort of problem.

groupby

groupby groups based on a key function. That means you need to keep the last encountered first non whitespace tuple element to make it work. That means you keep some global state around. By keeping such a state the function is said to be "unpure" while most (or even all) itertools are pure functions.

from itertools import groupby, chain

d = [('FRG', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
('FRG2', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4')]

def keyfunc(item):
first = item[0]
if first.strip():
keyfunc.state = first
return keyfunc.state

{k: [item for idx, item in enumerate(chain.from_iterable(grp)) if idx%3 != 0] for k, grp in groupby(d, keyfunc)}

takewhile

takewhile needs to look ahead to determine when to stop yielding values. That means it will automatically pop one value more from the iterator than actually used for each group. To actually apply it you would need to remember the last position and then create a new iterator each time. It also has the problem that you would need to keep some sort of state because you want to take one element with not-space first element and then the ones that have an space-only first element.

One approach could look like this (but feels unnecessarily complicated):

from itertools import takewhile, islice

def takegen(inp):
idx = 0
length = len(inp)
while idx < length:
first, *rest = inp[idx]
rest = list(rest)
for _, *lasts in takewhile(lambda x: not x[0].strip(), islice(inp, idx+1, None)):
rest.extend(lasts)
idx += len(rest) // 2
yield first, rest

dict(takegen(d))

Alternative

You could simply create your own generator to make this quite easy. It's a variation of the takewhile approach but it doesn't need external state, islice, takewhile, groupby or that one keeps track of the index:

def gen(inp):
# Initial values
last = None
for first, *rest in inp:
if last is None: # first encountered item
last = first
l = list(rest)
elif first.strip(): # when the first tuple item isn't all whitespaces
# Yield the last "group"
yield last, l
# New values for the next "group"
last = first
l = list(rest)
else: # when the first tuple item is all whitespaces
l.extend(rest)
# Yield the last group
yield last, l

dict(gen(d))
# {'FRG2': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'],
# 'FRG': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4']}

itertools groupby object not outputting correctly

Per the docs, it is explicitly noted that advancing the groupby object renders the previous group unusable (in practice, empty):

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

Basically, instead of list-ifying directly with the list constructor, you'd need a listcomp that converts from group iterators to lists before advancing the groupby object, replacing:

group_list = list(itertools.groupby(nums, key=lambda x: x>=0))

with:

group_list = [(k, list(g)) for k, g in itertools.groupby(nums, key=lambda x: x>=0)]

The design of most itertools module types is intended to avoid storing data implicitly, because they're intended to be used with potentially huge inputs. If all the groupers stored copies of all the data from the input (and the groupby object had to be sure to retroactively populate them), it would get ugly, and potentially blow memory by accident. By forcing you to make storing the values explicit, you don't accidentally store unbounded amounts of data unintentionally, per the Zen of Python:

Explicit is better than implicit.



Related Topics



Leave a reply



Submit