How do I use itertools.groupby()?
IMPORTANT NOTE: You have to sort your data first.
The part I didn't get is that in the example construction
groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
k
is the current grouping key, and g
is an iterator that you can use to iterate over the group defined by that grouping key. In other words, the groupby
iterator itself returns iterators.
Here's an example of that, using clearer variable names:
from itertools import groupby
things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print("A %s is a %s." % (thing[1], key))
print("")
This will give you the output:
A bear is a animal.
A duck is a animal.A cactus is a plant.
A speed boat is a vehicle.
A school bus is a vehicle.
In this example, things
is a list of tuples where the first item in each tuple is the group the second item belongs to.
The groupby()
function takes two arguments: (1) the data to group and (2) the function to group it with.
Here, lambda x: x[0]
tells groupby()
to use the first item in each tuple as the grouping key.
In the above for
statement, groupby
returns three (key, group iterator) pairs - once for each unique key. You can use the returned iterator to iterate over each individual item in that group.
Here's a slightly different example with the same data, using a list comprehension:
for key, group in groupby(things, lambda x: x[0]):
listOfThings = " and ".join([thing[1] for thing in group])
print(key + "s: " + listOfThings + ".")
This will give you the output:
animals: bear and duck.
plants: cactus.
vehicles: speed boat and school bus.
What is itertools.groupby() used for?
To start with, you may read the documentation here.
I will place what I consider to be the most important point first. I hope the reason will become clear after the examples.
ALWAYS SORT ITEMS WITH THE SAME KEY TO BE USED FOR GROUPING SO AS TO AVOID UNEXPECTED RESULTS
itertools.groupby(iterable, key=None or some func)
takes a list of iterables and groups them based on a specified key. The key specifies what action to apply to each individual iterable, the result of which is then used as the heading for each grouping the items; items which end up having same 'key' value will end up in the same group.
The return value is an iterable similar to a dictionary in that it is of the form {key : value}
.
Example 1
# note here that the tuple counts as one item in this list. I did not
# specify any key, so each item in the list is a key on its own.
c = groupby(['goat', 'dog', 'cow', 1, 1, 2, 3, 11, 10, ('persons', 'man', 'woman')])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
{1: [1, 1],
'goat': ['goat'],
3: [3],
'cow': ['cow'],
('persons', 'man', 'woman'): [('persons', 'man', 'woman')],
10: [10],
11: [11],
2: [2],
'dog': ['dog']}
Example 2
# notice here that mulato and camel don't show up. only the last element with a certain key shows up, like replacing earlier result
# the last result for c actually wipes out two previous results.
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
c = groupby(list_things, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
{'c': ['camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}
Now for the sorted version
# but observe the sorted version where I have the data sorted first on same key I used for grouping
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
sorted_list = sorted(list_things, key = lambda x: x[0])
print(sorted_list)
print()
c = groupby(sorted_list, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
['cow', 'cat', 'camel', 'dog', 'donkey', 'goat', 'mulato', 'mongoose', 'malloo', ('persons', 'man', 'woman'), 'wombat']
{'c': ['cow', 'cat', 'camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mulato', 'mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}
Example 3
things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "harley"), \
("vehicle", "speed boat"), ("vehicle", "school bus")]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic
results in
{'animal': [('animal', 'bear'), ('animal', 'duck')],
'plant': [('plant', 'cactus')],
'vehicle': [('vehicle', 'harley'),
('vehicle', 'speed boat'),
('vehicle', 'school bus')]}
Now for the sorted version. I changed the tuples to lists here. Same results either way.
things = [["animal", "bear"], ["animal", "duck"], ["vehicle", "harley"], ["plant", "cactus"], \
["vehicle", "speed boat"], ["vehicle", "school bus"]]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic
results in
{'animal': [['animal', 'bear'], ['animal', 'duck']],
'plant': [['plant', 'cactus']],
'vehicle': [['vehicle', 'harley'],
['vehicle', 'speed boat'],
['vehicle', 'school bus']]}
How to use python groupby()
Grouping input by common key elements with groupby()
only works on input already sorted by that key:
[...] Generally, the iterable needs to already be sorted on the same key function.
Your example should work like this:
from itertools import groupby
a = sorted([1, 2, 1, 3, 2, 1, 2, 3, 4, 5])
for key, value in groupby(a):
print((len(list(value)), key), end=' ')
If you use groupby()
on unorderd input you'll get a new group every time a different key is returned by the key
function while iterating through the iterable.
Python - itertools.groupby
That isn't how itertools.groupby
works. From the manual:
It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function)
So to achieve the type of grouping you want, you need to sort my_list
first:
import itertools
my_list = [
{'name': 'stock1', 'price': 200, 'shares': 100},
{'name': 'stock2', 'price': 1.2, 'shares': 1000},
{'name': 'stock3', 'price': 0.05, 'shares': 200000},
{'name': 'stock1', 'price': 200.2, 'shares': 50}
]
my_list.sort(key=lambda x:x['name'])
by_name = { name: list(items) for name, items in itertools.groupby(
my_list, key=lambda x: x['name'])}
print(by_name)
Output
{'stock1': [{'name': 'stock1', 'price': 200, 'shares': 100},
{'name': 'stock1', 'price': 200.2, 'shares': 50}],
'stock2': [{'name': 'stock2', 'price': 1.2, 'shares': 1000}],
'stock3': [{'name': 'stock3', 'price': 0.05, 'shares': 200000}]
}
python itertools groupby with filter usage
You can use itertools.groupby
to group all elements greater than 6
and with groups of length greater than 1. All other elements remain ungrouped.
If we want groups as standalone lists, we can use append
. If we want groups flattened, we can use extend
.
from itertools import groupby
lst = [1, 2, 3, 3, 6, 8, 8, 10, 2, 5, 7, 7]
result = []
for k, g in groupby(lst):
group = list(g)
if k > 6 and len(group) > 1:
result.append(group)
else:
result.extend(group)
print(result)
Output:
[1, 2, 3, 3, 6, [8, 8], 10, 2, 5, [7, 7]]
How to specify types for itertools groupby?
Refering to this post i noticed that the return type of group by is:
Iteratable[Tuple[<key_type>, Iterable[<item_type>]]]
So, the type in my case would be:
group_iter: Iterator[Tuple[int, Iterable[str]]] = groupby(sorted(names,key=len), lambda x: len(x))
How to use itertools.groupby with a true/false lambda function
itertools.groupby()
will return an alternating sequence of countries and cities. When it returns a country, you save the country. When it returns cities, you add an entry to the dictionary with the saved country.
result = {}
for is_country, values in itertools.groupby(filtered, key = lambda line: line.endswith("[country]")):
if is_country:
country = next(values)
else:
result[country] = list(values)
Itertools groupby to group list using another list
For the general case of grouping one iterable based on the matching value in another iterable, you can just make a cheaty key
function that iterates the other iterable, e.g. using your original s
and g
:
>>> from itertools import groupby
>>> print([(k, len(list(grp))) for k, grp in groupby(s, key=lambda _, ig=iter(g): next(ig))])
[(0.0, 3), (2.0, 3), (4.0, 2)]
The key
function accepts the value from s
and ignores it, instead returning the matching value from iterating g
manually (the defaulted second argument caches an iterator created from g
, then next
is used to manually advance it each time; pass a second argument to next
to silently ignore mismatched lengths and simply substitute in a default value).
Obviously, for this specific case there are better approaches, but I'm answering the general question asked, not the specific example.
Python itertools groupby
The functions groupby
and takewhile
aren't good fits for this sort of problem.
groupby
groupby
groups based on a key
function. That means you need to keep the last encountered first non whitespace tuple element to make it work. That means you keep some global state around. By keeping such a state the function is said to be "unpure" while most (or even all) itertools are pure functions.
from itertools import groupby, chain
d = [('FRG', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
('FRG2', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
(' ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4')]
def keyfunc(item):
first = item[0]
if first.strip():
keyfunc.state = first
return keyfunc.state
{k: [item for idx, item in enumerate(chain.from_iterable(grp)) if idx%3 != 0] for k, grp in groupby(d, keyfunc)}
takewhile
takewhile
needs to look ahead to determine when to stop yield
ing values. That means it will automatically pop one value more from the iterator than actually used for each group. To actually apply it you would need to remember the last position and then create a new iterator each time. It also has the problem that you would need to keep some sort of state because you want to take one element with not-space first element and then the ones that have an space-only first element.
One approach could look like this (but feels unnecessarily complicated):
from itertools import takewhile, islice
def takegen(inp):
idx = 0
length = len(inp)
while idx < length:
first, *rest = inp[idx]
rest = list(rest)
for _, *lasts in takewhile(lambda x: not x[0].strip(), islice(inp, idx+1, None)):
rest.extend(lasts)
idx += len(rest) // 2
yield first, rest
dict(takegen(d))
Alternative
You could simply create your own generator to make this quite easy. It's a variation of the takewhile
approach but it doesn't need external state, islice
, takewhile
, groupby
or that one keeps track of the index:
def gen(inp):
# Initial values
last = None
for first, *rest in inp:
if last is None: # first encountered item
last = first
l = list(rest)
elif first.strip(): # when the first tuple item isn't all whitespaces
# Yield the last "group"
yield last, l
# New values for the next "group"
last = first
l = list(rest)
else: # when the first tuple item is all whitespaces
l.extend(rest)
# Yield the last group
yield last, l
dict(gen(d))
# {'FRG2': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'],
# 'FRG': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4']}
itertools groupby object not outputting correctly
Per the docs, it is explicitly noted that advancing the groupby
object renders the previous group unusable (in practice, empty):
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
Basically, instead of list
-ifying directly with the list
constructor, you'd need a listcomp that converts from group iterators to list
s before advancing the groupby
object, replacing:
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
with:
group_list = [(k, list(g)) for k, g in itertools.groupby(nums, key=lambda x: x>=0)]
The design of most itertools
module types is intended to avoid storing data implicitly, because they're intended to be used with potentially huge inputs. If all the groupers stored copies of all the data from the input (and the groupby
object had to be sure to retroactively populate them), it would get ugly, and potentially blow memory by accident. By forcing you to make storing the values explicit, you don't accidentally store unbounded amounts of data unintentionally, per the Zen of Python:
Explicit is better than implicit.
Related Topics
How to Split the Definition of a Long String Over Multiple Lines
Using Module 'Subprocess' With Timeout
How to Wait Some Time in Pygame
Find If 24 Hrs Have Passed Between Datetimes
Save Plot to Image File Instead of Displaying It Using Matplotlib
Lambda in For Loop Only Takes Last Value
How to Convert an Rgb Image into Grayscale in Python
Checking Whether a Variable Is an Integer or Not
How to Get Line Count of a Large File Cheaply in Python
Are For-Loops in Pandas Really Bad? When Should I Care
Pandas Groupby With Delimiter Join
What Does the "At" (@) Symbol Do in Python
Display Number With Leading Zeros