R: Clustering Results Are Different Everytime I Run

k-mode clustering in R returns different cluster sizes with each run

see https://stats.stackexchange.com/questions/58238/how-random-are-the-results-of-the-kmeans-algorithm

There is more than one k-means algorithm.

You probably refer to Lloyds algorithm, which only depends on the initial cluster centers. But there also is MacQueen's, which depends on the sequence i.e. ordering of points. Then there is Hartigan, Wong, Forgy,

various implementations may have implementation and optimization differences. They may treat ties differently, too! For example, many naive implementations will always assign elements to the first or last cluster when tied.

Furthermore, the clusters may end up being reordered by memory address after finishing k-means, so you cannot safely assume that cluster 1 remains cluster 1 even if k-means converged after the first iteration. Others will reorder clusters by cluster size (which actually makes sense for k-means, as this will more likely return the same result on different random initialization)

it really depends on what kind of data you have. If it is nicely split into spherical-shaped clusters then you will typically get very very similar clusters. If not, then you might get pretty random clusters each time.

set.seed(1)

Everytime K-Means initializes the centroid, it is generated randomly, which is needing seed for generating random values.

I'm getting different results every time I run my code

I would say, since you are using RandomlyGeneratedInitialMeans:

Initialize k-means by generating random vectors (within the data sets value range).

RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);

Yes, it is normal.

Finding clustering results in R

I am guessing you more or less know there's two clusters, and you want to see whether clustering gives you a good separation on the Median variable.

First we look at your data frame:

summary(productQuality1.1)
weld.type.ID weld.type alpha beta
Min. : 1 1,40,Material A : 1 Min. : 3.00 Min. : 140
1st Qu.: 9 1,40S,Material C : 1 1st Qu.: 9.00 1st Qu.: 403
Median :17 1,80,Material A : 1 Median : 24.00 Median : 621
Mean :17 1,STD,Material A : 1 Mean : 54.24 Mean :1383
3rd Qu.:25 1,XS,Material A : 1 3rd Qu.: 44.00 3rd Qu.:1744
Max. :33 10,10S,Material C: 1 Max. :442.00 Max. :7194
(Other) :27
Median
Min. :0.009892
1st Qu.:0.016689
Median :0.029403
Mean :0.036686
3rd Qu.:0.042695
Max. :0.139336

You can only use alpha and beta, since ID, weld.type are unique entries (like identifiers). We do:

clus = kmeans(productQuality1.1[,c("alpha","beta")],2)
productQuality1.1$cluster = factor(clus$cluster)

Note that I use your alpha and beta values are on very different scales to start with. And we can visualize the clustering:

ggplot(productQuality1.1,aes(x=alpha,y=beta,col=cluster)) + geom_point()

Sample Image

It's not going to be easy to cut these observations into 2 clusters just using kmeans because some of them have really high alpha / beta values. We can also look at how your median values are spread:

ggplot(productQuality1.1,aes(x=alpha,y=beta,col=Median)) +
geom_point() + scale_color_viridis_c()

Sample Image

Lastly we look at median values:

ggplot(productQuality1.1,aes(x=Median,col=cluster)) + geom_density()

Sample Image

I would say there are some in cluster 2 with a higher median, but some which you don't separate that easily. Given what we see in the scatter plots, might have to think more about how to use the alpha and beta values you have.



Related Topics



Leave a reply



Submit