Simple Approach to Assigning Clusters for New Data After K-Means Clustering

Simple approach to assigning clusters for new data after k-means clustering

You could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50

Simple approach to assigning clusters for new data after k-modes clustering

We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.

## From klaR::kmodes

distance <- function(mode, obj, weights) {
if (is.null(weights))
return(sum(mode != obj))
obj <- as.character(obj)
mode <- as.character(mode)
different <- which(mode != obj)
n_mode <- n_obj <- numeric(length(different))
for (i in seq(along = different)) {
weight <- weights[[different[i]]]
names <- names(weight)
n_mode[i] <- weight[which(names == mode[different[i]])]
n_obj[i] <- weight[which(names == obj[different[i]])]
}
dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
return(dist)
}

AssignCluster <- function(df,kmeansObj)
{
apply(
apply(df,1,function(obj)
{
apply(kmeansObj$modes,1,distance,obj,NULL)
}),
2, which.min)
}

AssignCluster(mydf2,mymodel)

[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1

Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min will then choose the cluster with the lowest number.

R - Clustering (K-means) within groups

We can use the first group to split the data and apply kmeans to only subset of data. Make sure to use correct number of k though because it depends on how the first group is created.

library(dplyr)
library(purrr)

df1 %>%
group_split(group = kmeans(.[,c('start.x', 'start.y', 'end.x', 'end.y')],
4)$cluster) %>%
map_df(~.x %>% mutate(new_group =
kmeans(.x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))

In base R, you could use by which does split, apply and combine operation.

df1$new_group <- unlist(by(df1, df1$group, function(x) 
kmeans(x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))


Related Topics



Leave a reply



Submit