Simple approach to assigning clusters for new data after k-means clustering
You could use the flexclust package, which has an implemented predict
method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans
or cluster::pam
to objects of class kcca
and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Problems with cluster assignment after clustering
From my answers at cross validated:
It's because df-colmeans(df)
doesn't do what you think.
Let's try the code:
a=matrix(1:9,nrow=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
colMeans(a)
[1] 2 5 8
a-colMeans(a)
[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1
apply(a,2,function(x) x-mean(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
you'll find that a-colMeans(a)
does a different thing than apply(a,2,function(x) x-mean(x))
, which is what you'll want for centering.
You could write an apply
to do the full autoscaling for you:
apply(a,2,function(x) (x-mean(x))/sd(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
scale(a)
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1
But there's no point in doing that apply, since scale
will do it for you. :)
Moreover, to try out the clustering:
set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)
for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)
# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]
# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))
centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust
whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize
> table(whichMins)
whichMins
3
2532
HTH HAND,
Carl
Error in if (sum(abs(dc)) 1e-15) break : missing value where TRUE/FALSE needed: Kernel K-Means kernlab
This appears to be an issue with something randomly-generated internally by the function during your kkmeans()
call. I don't have an answer for "why" this is happening and you'll likely have to check with the authors to determine if it's a bug or intended behavior.
While I reproduced your error with your data and code (running a fresh instance of R every time), the exact same function call also sometimes produces other errors and sometimes doesn't produce an error. However, whether it does so is entirely reproducible when you set.seed()
, suggesting it is has something to do with starting values that determine other parameters of the model.
Below I show (a) that this can produce an alternative error (actually, I saw a third but didn't save the seed to reproduce it), (b) that even when it does "converge," it is producing pretty different clusters just on the basis of the random seed, and (c) the hyperparameter tuning is heavily influenced by the random number seed. I forgot to save the seed for the run where I was able to get some clustering results with 10 clusters.
I don't have an answer for why this happens: my hunch is that the automatically-generated settings are nonsensical/out of bounds in some cases and this is producing an error. This may be because your data are in some way strange or may be because the algorithm for setting the hyperparameter(s) doesn't make much sense. It could also be a bug, so perhaps worth posting as an issue.
In any case, a question to ask yourself is whether you want to use something where the behavior is this inconsistent at producing results, produces pretty different results across random seeds, and you don't know if the algorithm is actually doing what it says when it does, etc.
Example 1: clusters=5
, no error, set.seed(123)
set.seed(123)
#> Hyperparameter : sigma = 0.463522505156128
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.53045 -21.18700 187.8918
#> [2,] 17.16138 -24.59687 184.7860
#> [3,] 15.73436 -17.87491 191.2586
#> [4,] 15.63425 -16.63862 192.0088
#> [5,] 16.19467 -20.16442 189.1617
#>
#> Cluster size:
#> [1] 11 8 11 8 12
#>
#> Within-cluster sum of squares:
#> [1] 537972.8 386310.2 544994.1 391965.9 604386.9
Example 2: clusters=5
, no error, set.seed(3)
Works, but pretty different numbers of observations per cluster! Note the different hyperparameter.
#> Hyperparameter : sigma = 0.290281708176631
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.97636 -18.38464 190.5449
#> [2,] 16.24809 -20.10409 188.9572
#> [3,] 15.63660 -17.85633 191.5151
#> [4,] 17.06100 -22.70840 185.8834
#> [5,] 17.16138 -24.59687 184.7860
#>
#> Cluster size:
#> [1] 11 11 15 5 8
#>
#> Within-cluster sum of squares:
#> [1] 545547.7 538434.5 757947.0 236986.8 386310.2
Example 3: clusters=5
, no error, set.seed(999)
Works, but pretty different numbers of observations per cluster! Note the different hyperparameter again!
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.128189488632645
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.93157 -22.25171 186.4579
#> [2,] 15.45090 -15.99500 192.8452
#> [3,] 15.73677 -18.32277 191.0152
#> [4,] 17.16244 -24.44533 184.8376
#> [5,] 16.32218 -20.69291 188.5965
#>
#> Cluster size:
#> [1] 7 10 13 9 11
#>
#> Within-cluster sum of squares:
#> [1] 294630.1 457490.3 604486.8 441669.5 539478.6
Example 4: clusters = 10
, new error, set.seed(99)
New error.
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'affinMult' for signature '"rbfkernel", "numeric"'
Example 5: clusters = 10
, new error, set.seed(3)
Original error.
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
Not included: additional error with clusters = 10 (not finding all of the columns in the matrix) and successfully getting some clusters with clusters = 10.
k-means clustered data: how to label newly incoming data
You dont need SVM thing.First way is more convenient.If you are using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html there is an example here.predict
function will do your job.
assign cluster labels to data using a cluster assignment matrix
The vector is already binary. We can add 1L
to the second column:
clust <- amat[,2] + 1L
[1] 1 1 2 2 1 1 1 2 2 2
(The suffix L
coerces the value to integer
)
How to define Kmeans cluster of the new data
Use the FilteredClusterer
, and then choose KMeans
in the Configuration Dialog of the FilteredClusterer.
Here is some text from the "More" button that shows some documentation about this clusterer:
NAME weka.clusterers.FilteredClusterer
SYNOPSIS Class for running an arbitrary clusterer on data that has
been passed through an arbitrary filter. Like the clusterer, the
structure of the filter is based exclusively on the training data
and test instances will be processed by the filter without changing
their structure.
Related Topics
Mutating Multiple Columns in a Data Frame Using Dplyr
Read.CSV Is Extremely Slow in Reading CSV Files with Large Numbers of Columns
Email Dataframe as Table in Email Body with Sendmailr
R Shiny Mouseover Text for Table Columns
Checking Cran Incoming Feasibility ... Note Maintainer
How to Create a R Timeseries for Hourly Data
Sort a Factor Based on Value in One or More Other Columns
Split a String Vector at Whitespace
How to Create a Raster from a Data Frame in R
Why Do Logicals (Booleans) in R Require 4 Bytes
How to Replace Empty String with Na in R Dataframe
Debugging (Line by Line) of Rcpp-Generated Dll Under Windows
Ggplot2: Different Legend Symbols for Points and Lines
Show That Shiny Is Busy (Or Loading) When Changing Tab Panels
How to Attach a Simple Data.Frame to a Spatialpolygondataframe in R