﻿ Cluster Analysis in R: Determine the Optimal Number of Clusters - ITCodar

# Cluster Analysis in R: Determine the Optimal Number of Clusters

## Calculating optimal number of clusters with Nbclust()

Simple means of determining number of clusters is to examine the elbow in the plot of within groups sum of squares and/or average width of the silhouette, the code produces simple plots to examine these...

In order to perform clustering, you need to solve the problem of `NaN`s after scaling...

``WKA_ohneJB_scaled <- as.matrix(scale(data[, c(-1, -2, -18)]))plot_scree_clusters <- function(x) {  wss <- 0  max_i <- 10 # max clusters  for (i in 1:max_i) {    km.model <- kmeans(x, centers = i, nstart = 20)    wss[i] <- km.model\$tot.withinss  }  plot(1:max_i, wss, type = "b",       xlab = "Number of Clusters",       ylab = "Within groups sum of squares")}plot_scree_clusters(WKA_ohneJB_scaled)plot_sil_width <- function(x) {  sw <- 0  max_i <- 10 # max clusters  for (i in 2:max_i) {    km.model <- cluster::pam(x = pc_comp\$x, k = i)    sw[i] <- km.model\$silinfo\$avg.width  }  sw <- sw[-1]  plot(2:max_i, sw, type = "b",       xlab = "Number of Clusters",       ylab = "Average silhouette width")}plot_sil_width(WKA_ohneJB_scaled)``

## Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.

## How to get the optimal number of clusters from the clusGap function as an output?

Typically such information is somewhere directly inside the object, like `gap_stat\$nc`. To look for it `str(gap_stat)` would typically suffice.

In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that `print.clusGap` (because the class of `gap_stat` is clusGap) will show how to obtain this number. So, inspecting `cluster:::print.clusGap` leads to

``maxSE(f = gap_stat\$Tab[, "gap"], SE.f = gap_stat\$Tab[, "SE.sim"])#  1``