Grouping / clustering numbers in Python
There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:
def cluster(data, maxgap):
'''Arrange data into groups where successive elements
differ by no more than *maxgap*
>>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
[[1, 6, 9], [100, 102, 105, 109], [134, 139]]
>>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
[[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]
'''
data.sort()
groups = [[data[0]]]
for x in data[1:]:
if abs(x - groups[-1][-1]) <= maxgap:
groups[-1].append(x)
else:
groups.append([x])
return groups
if __name__ == '__main__':
import doctest
print(doctest.testmod())
Grouping / clustering a list of numbers so that the min-max gap of each subset is always less than a cutoff in Python
Following @j_random_hacker's answer, I simply change my code to
def cluster(data, cutoff):
data.sort()
res = []
old_x = -10.
for x in data:
if abs(x - old_x) > cutoff:
res.append([x])
old_x = x
else:
res[-1].append(x)
return res
Now it is working as expected
>>> print(all([(max(s) - min(s)) < cutoff for s in res]))
True
Group data by a given Range
If you know the "range" around each cluster mean
A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.
To group, you can use a combination of sorted
and itertools.groupby
, or more simply, you can use a dict
of lists
.
from collections import defaultdict
def clusters(data, r):
groups = defaultdict(list)
for x in data:
groups[x // r].append(x)
return groups
def means_of_clusters(data, r):
return [sum(g) / len(g) for g in clusters(data, r).values()]
print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]
Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4
, whereas 5.7 was rounded to 14*0.4
.
If you know the number of clusters
You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:
def split_in_2_clusters(data):
seq = sorted(data)
split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
return seq[:split_index], seq[split_index:]
def means_of_2_clusters(data):
return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))
print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)
For more complex clustering problems
I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.
- scikit-learn: Clustering algorithms
Grouping a set of like numbers in a Python List
loop through the data and increment peaks
only if it hasn't already done so with each peak
counted_peak = False
peaks = 0
for v in data:
if v:
if not counted_peak:
peaks += 1
# set to True so that it only increments once for each peak
counted_peak = True
else:
counted_peak = False
Finding clusters of numbers in a list
Not strictly necessary if your list is small, but I'd probably approach this in a "stream-processing" fashion: define a generator that takes your input iterable, and yields the elements grouped into runs of numbers differing by <= 15. Then you can use that to generate your dictionary easily.
def grouper(iterable):
prev = None
group = []
for item in iterable:
if prev is None or item - prev <= 15:
group.append(item)
else:
yield group
group = [item]
prev = item
if group:
yield group
numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
dict(enumerate(grouper(numbers), 1))
prints:
{1: [123, 124, 128],
2: [160, 167],
3: [213, 215, 230, 245, 255, 257],
4: [400, 401, 402],
5: [430]}
As a bonus, this lets you even group your runs for potentially-infinite lists (as long as they're sorted, of course). You could also stick the index generation part into the generator itself (instead of using enumerate
) as a minor enhancement.
Related Topics
Reading a Text File and Splitting It into Single Words in Python
How Does Condensed Distance Matrix Work? (Pdist)
Pip Installing in Global Site-Packages Instead of Virtualenv
Spark Dataframe Distinguish Columns with Duplicated Name
Numpy Index Slice Without Losing Dimension Information
Pandas/Python: Set Value of One Column Based on Value in Another Column
Accessing Mp3 Metadata with Python
How to Activate a Virtualenv Inside Pycharm's Terminal
How to Read and Write Ini File with Python3
How to Format a String Using a Dictionary in Python-3.X
What Is Python Whitespace and How Does It Work
Libxml Install Error Using Pip