Grouping/Clustering Numbers in Python

Grouping / clustering numbers in Python

There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:

def cluster(data, maxgap):
'''Arrange data into groups where successive elements
differ by no more than *maxgap*

>>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
[[1, 6, 9], [100, 102, 105, 109], [134, 139]]

>>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
[[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

'''
data.sort()
groups = [[data[0]]]
for x in data[1:]:
if abs(x - groups[-1][-1]) <= maxgap:
groups[-1].append(x)
else:
groups.append([x])
return groups

if __name__ == '__main__':
import doctest
print(doctest.testmod())

Grouping / clustering a list of numbers so that the min-max gap of each subset is always less than a cutoff in Python

Following @j_random_hacker's answer, I simply change my code to

def cluster(data, cutoff):
data.sort()
res = []
old_x = -10.
for x in data:
if abs(x - old_x) > cutoff:
res.append([x])
old_x = x
else:
res[-1].append(x)
return res

Now it is working as expected

>>> print(all([(max(s) - min(s)) < cutoff for s in res]))
True

Group data by a given Range

If you know the "range" around each cluster mean

A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.

To group, you can use a combination of sorted and itertools.groupby, or more simply, you can use a dict of lists.

from collections import defaultdict

def clusters(data, r):
groups = defaultdict(list)
for x in data:
groups[x // r].append(x)
return groups

def means_of_clusters(data, r):
return [sum(g) / len(g) for g in clusters(data, r).values()]

print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]

Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4, whereas 5.7 was rounded to 14*0.4.

If you know the number of clusters

You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:

def split_in_2_clusters(data):
seq = sorted(data)
split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
return seq[:split_index], seq[split_index:]

def means_of_2_clusters(data):
return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))

print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)

For more complex clustering problems

I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.

  • scikit-learn: Clustering algorithms

Grouping a set of like numbers in a Python List

loop through the data and increment peaks only if it hasn't already done so with each peak

counted_peak = False
peaks = 0
for v in data:
if v:
if not counted_peak:
peaks += 1
# set to True so that it only increments once for each peak
counted_peak = True
else:
counted_peak = False

Finding clusters of numbers in a list

Not strictly necessary if your list is small, but I'd probably approach this in a "stream-processing" fashion: define a generator that takes your input iterable, and yields the elements grouped into runs of numbers differing by <= 15. Then you can use that to generate your dictionary easily.

def grouper(iterable):
prev = None
group = []
for item in iterable:
if prev is None or item - prev <= 15:
group.append(item)
else:
yield group
group = [item]
prev = item
if group:
yield group

numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
dict(enumerate(grouper(numbers), 1))

prints:

{1: [123, 124, 128],
2: [160, 167],
3: [213, 215, 230, 245, 255, 257],
4: [400, 401, 402],
5: [430]}

As a bonus, this lets you even group your runs for potentially-infinite lists (as long as they're sorted, of course). You could also stick the index generation part into the generator itself (instead of using enumerate) as a minor enhancement.



Related Topics



Leave a reply



Submit