Grouping/Clustering Numbers in Python

Grouping / clustering numbers in Python

There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        >>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        >>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

if __name__ == '__main__':
    import doctest
    print(doctest.testmod())

Grouping / clustering a list of numbers so that the min-max gap of each subset is always less than a cutoff in Python

Following @j_random_hacker's answer, I simply change my code to

def cluster(data, cutoff):
    data.sort()
    res = []
    old_x = -10.
    for x in data:
        if abs(x - old_x) > cutoff:
            res.append([x])
            old_x = x
        else:
            res[-1].append(x)
    return res

Now it is working as expected

>>> print(all([(max(s) - min(s)) < cutoff for s in res]))
True

Group data by a given Range

If you know the "range" around each cluster mean

A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.

To group, you can use a combination of sorted and itertools.groupby, or more simply, you can use a dict of lists.

from collections import defaultdict

def clusters(data, r):
    groups = defaultdict(list)
    for x in data:
        groups[x // r].append(x)
    return groups

def means_of_clusters(data, r):
    return [sum(g) / len(g) for g in clusters(data, r).values()]

print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]

Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4, whereas 5.7 was rounded to 14*0.4.

If you know the number of clusters

You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:

def split_in_2_clusters(data):
    seq = sorted(data)
    split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
    return seq[:split_index], seq[split_index:]

def means_of_2_clusters(data):
    return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))

print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)

For more complex clustering problems

I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.

scikit-learn: Clustering algorithms

Grouping a set of like numbers in a Python List

loop through the data and increment peaks only if it hasn't already done so with each peak

counted_peak = False
peaks = 0
for v in data:
    if v:
        if not counted_peak:
            peaks += 1
            # set to True so that it only increments once for each peak
            counted_peak = True
    else:
        counted_peak = False

Finding clusters of numbers in a list

Not strictly necessary if your list is small, but I'd probably approach this in a "stream-processing" fashion: define a generator that takes your input iterable, and yields the elements grouped into runs of numbers differing by <= 15. Then you can use that to generate your dictionary easily.

def grouper(iterable):
    prev = None
    group = []
    for item in iterable:
        if prev is None or item - prev <= 15:
            group.append(item)
        else:
            yield group
            group = [item]
        prev = item
    if group:
        yield group

numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
dict(enumerate(grouper(numbers), 1))

prints:

{1: [123, 124, 128],
 2: [160, 167],
 3: [213, 215, 230, 245, 255, 257],
 4: [400, 401, 402],
 5: [430]}

As a bonus, this lets you even group your runs for potentially-infinite lists (as long as they're sorted, of course). You could also stick the index generation part into the generator itself (instead of using enumerate) as a minor enhancement.

Grouping/Clustering Numbers in Python