Simple approach to assigning clusters for new data after k-means clustering
You could use the flexclust package, which has an implemented predict
method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans
or cluster::pam
to objects of class kcca
and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Simple approach to assigning clusters for new data after k-modes clustering
We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.
## From klaR::kmodes
distance <- function(mode, obj, weights) {
if (is.null(weights))
return(sum(mode != obj))
obj <- as.character(obj)
mode <- as.character(mode)
different <- which(mode != obj)
n_mode <- n_obj <- numeric(length(different))
for (i in seq(along = different)) {
weight <- weights[[different[i]]]
names <- names(weight)
n_mode[i] <- weight[which(names == mode[different[i]])]
n_obj[i] <- weight[which(names == obj[different[i]])]
}
dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
return(dist)
}
AssignCluster <- function(df,kmeansObj)
{
apply(
apply(df,1,function(obj)
{
apply(kmeansObj$modes,1,distance,obj,NULL)
}),
2, which.min)
}
AssignCluster(mydf2,mymodel)
[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1
Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min
will then choose the cluster with the lowest number.
R - Clustering (K-means) within groups
We can use the first group to split the data and apply kmeans
to only subset of data. Make sure to use correct number of k
though because it depends on how the first group is created.
library(dplyr)
library(purrr)
df1 %>%
group_split(group = kmeans(.[,c('start.x', 'start.y', 'end.x', 'end.y')],
4)$cluster) %>%
map_df(~.x %>% mutate(new_group =
kmeans(.x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))
In base R, you could use by
which does split, apply and combine operation.
df1$new_group <- unlist(by(df1, df1$group, function(x)
kmeans(x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))
Related Topics
Lme4::Lmer Reports "Fixed-Effect Model Matrix Is Rank Deficient", Do I Need a Fix and How To
How to Get Unsaved Script Tabs
How to Access and Edit Rprofile
Add a Box for the Na Values to the Ggplot Legend for a Continuous Map
Dplyr Broadcasting Single Value Per Group in Mutate
How to Plot Multiple Stacked Histograms Together in R
Count How Many Values in Some Cells of a Row Are Not Na (In R)
R - Markdown Avoiding Package Loading Messages
Apply a Ggplot-Function Per Group with Dplyr and Set Title Per Group
Plot a Function with Ggplot, Equivalent of Curve()
How to Index an Element of a List Object in R
R: Lm() Result Differs When Using 'Weights' Argument and When Using Manually Reweighted Data
Generate Correlated Random Numbers from Binomial Distributions
Returning Above and Below Rows of Specific Rows in R Dataframe
Calculate Group Mean While Excluding Current Observation Using Dplyr