Simple Approach to Assigning Clusters for New Data After K-Means Clustering

Simple approach to assigning clusters for new data after k-means clustering

You could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1    
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
#  1   2   3   4 
#130 181  98  91 

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’ 
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
#  1  2 
#  50 50

Simple approach to assigning clusters for new data after k-modes clustering

We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.

## From klaR::kmodes

distance <- function(mode, obj, weights) {
  if (is.null(weights)) 
    return(sum(mode != obj))
  obj <- as.character(obj)
  mode <- as.character(mode)
  different <- which(mode != obj)
  n_mode <- n_obj <- numeric(length(different))
  for (i in seq(along = different)) {
    weight <- weights[[different[i]]]
    names <- names(weight)
    n_mode[i] <- weight[which(names == mode[different[i]])]
    n_obj[i] <- weight[which(names == obj[different[i]])]
  }
  dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
  return(dist)
}

AssignCluster <- function(df,kmeansObj)
{
  apply(
    apply(df,1,function(obj)
  {
    apply(kmeansObj$modes,1,distance,obj,NULL)
  }),
  2, which.min)
}

AssignCluster(mydf2,mymodel)

[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1

Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min will then choose the cluster with the lowest number.

R - Clustering (K-means) within groups

We can use the first group to split the data and apply kmeans to only subset of data. Make sure to use correct number of k though because it depends on how the first group is created.

library(dplyr)
library(purrr)

df1 %>%
  group_split(group = kmeans(.[,c('start.x', 'start.y', 'end.x', 'end.y')], 
                             4)$cluster) %>%
   map_df(~.x %>% mutate(new_group = 
     kmeans(.x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))

In base R, you could use by which does split, apply and combine operation.

df1$new_group <- unlist(by(df1, df1$group, function(x) 
        kmeans(x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))

Simple Approach to Assigning Clusters for New Data After K-Means Clustering