﻿ Cluster Analysis in R: Determine the Optimal Number of Clusters - ITCodar

Cluster Analysis in R: Determine the Optimal Number of Clusters

Calculating optimal number of clusters with Nbclust()

Simple means of determining number of clusters is to examine the elbow in the plot of within groups sum of squares and/or average width of the silhouette, the code produces simple plots to examine these...

In order to perform clustering, you need to solve the problem of `NaN`s after scaling...

``WKA_ohneJB_scaled <- as.matrix(scale(data[, c(-1, -2, -18)]))plot_scree_clusters <- function(x) {  wss <- 0  max_i <- 10 # max clusters  for (i in 1:max_i) {    km.model <- kmeans(x, centers = i, nstart = 20)    wss[i] <- km.model\$tot.withinss  }  plot(1:max_i, wss, type = "b",       xlab = "Number of Clusters",       ylab = "Within groups sum of squares")}plot_scree_clusters(WKA_ohneJB_scaled)plot_sil_width <- function(x) {  sw <- 0  max_i <- 10 # max clusters  for (i in 2:max_i) {    km.model <- cluster::pam(x = pc_comp\$x, k = i)    sw[i] <- km.model\$silinfo\$avg.width  }  sw <- sw[-1]  plot(2:max_i, sw, type = "b",       xlab = "Number of Clusters",       ylab = "Average silhouette width")}plot_sil_width(WKA_ohneJB_scaled)``

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.

How to get the optimal number of clusters from the clusGap function as an output?

Typically such information is somewhere directly inside the object, like `gap_stat\$nc`. To look for it `str(gap_stat)` would typically suffice.

In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that `print.clusGap` (because the class of `gap_stat` is clusGap) will show how to obtain this number. So, inspecting `cluster:::print.clusGap` leads to

``maxSE(f = gap_stat\$Tab[, "gap"], SE.f = gap_stat\$Tab[, "SE.sim"])# [1] 1``