Cluster Analysis in R: Determine the Optimal Number of Clusters

Calculating optimal number of clusters with Nbclust()

Simple means of determining number of clusters is to examine the elbow in the plot of within groups sum of squares and/or average width of the silhouette, the code produces simple plots to examine these...

In order to perform clustering, you need to solve the problem of NaNs after scaling...

WKA_ohneJB_scaled <- as.matrix(scale(data[, c(-1, -2, -18)]))

plot_scree_clusters <- function(x) {
wss <- 0
max_i <- 10 # max clusters
for (i in 1:max_i) {
km.model <- kmeans(x, centers = i, nstart = 20)
wss[i] <- km.model$tot.withinss
}
plot(1:max_i, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
}

plot_scree_clusters(WKA_ohneJB_scaled)

plot_sil_width <- function(x) {
sw <- 0
max_i <- 10 # max clusters
for (i in 2:max_i) {
km.model <- cluster::pam(x = pc_comp$x, k = i)
sw[i] <- km.model$silinfo$avg.width
}
sw <- sw[-1]
plot(2:max_i, sw, type = "b",
xlab = "Number of Clusters",
ylab = "Average silhouette width")
}

plot_sil_width(WKA_ohneJB_scaled)

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.

How to get the optimal number of clusters from the clusGap function as an output?

Typically such information is somewhere directly inside the object, like gap_stat$nc. To look for it str(gap_stat) would typically suffice.

In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that print.clusGap (because the class of gap_stat is clusGap) will show how to obtain this number. So, inspecting cluster:::print.clusGap leads to

maxSE(f = gap_stat$Tab[, "gap"], SE.f = gap_stat$Tab[, "SE.sim"])
# [1] 1


Related Topics



Leave a reply



Submit