Cluster One-Dimensional Data Optimally

which clustering algorithm is best for clustering one-dimensional features?

All of these methods are better for multivariate data. Except for k-means which historically was used on oneudimensional data, they were all designed with the multivariate problem in mind, and none of them is well optimized for the particular case of 1-dimensional data.

For one-dimensional data, use kernel density estimation. KDE is a nice technique in 1d, has a strong statistical support, and becomes hard to use for clustering in multiple dimensions.

1D Number Array Clustering

Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.

In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.

You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.

With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.

See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):

KDE with Python

Grouping or Clustering Algorithm

Why do you work on the pairwise differences? Consider the values 1, 2, 101, 102, 201, 202. Pairwise differences are 1,100,101,200,201,99,100,199,200,1,100,101,99,100,1

The values of ~200 bear no information. There is a different "cluster" inbetween. You shouldn't use them for your analysis.

Instead, grab a statistics textbook and look up Kernel Density Estimation. Don't bother to look for clustering - these methods are usually designed for the multivariate case. Your data is 1 dimensional. It can be sorted (it probably already is), and this can be exploited for better results.

There are well-established heuristics for density estimation on such data, and you can split your data on local minimum density (or simply at a low density threshold). This is much simpler, yet robust and reliable. You don't need to set a paramter such as k for k-means. There are cases where k-means is a good choice - it has origins in signal detection, where it was known that there are k=10 different signal frequencies. Today, it is mostly used for multidimensional data.

See also:

  • Cluster one-dimensional data optimally?
  • 1D Number Array Clustering
  • partitioning an float array into similar segments (clustering)
  • What clustering algorithm to use on 1-d data?

cluster one-dimensional data using pvclust

First of all, let me state that none of these methods is meant for one-dimensional data.

For one-dimensional data, please use a method that exploits that the data can be sorted. For example, use a method based on kernel density estimation.

The term "cluster analysis" is usually used with multidimensional data only. In one dimensional, there are much better methods. See also "natural breaks optimization", but IMHO you should be using kernel density estimation: split the data at local minima in the KDE.

Now to your actual question. Most likely the problem is that you are ... passing 1 dimensional data. Which is interpreted as one record, with d dimensions, and thus the method complains about having a single sample only. You may have success by first transposing your record.

With your hack of adding zero records, the result most likely becomes bogus. You are probably clustering a data set that has 1 vector that contains your data, and 3 vectors that are all zero...

But in the end, you should not be using these methods here anyway! Use a method that exploits that your data can be sorted.



Related Topics



Leave a reply



Submit