Assign New Data Point to Cluster in Kernel K-Means (Kernlab Package in R)

Simple approach to assigning clusters for new data after k-means clustering

You could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50

Problems with cluster assignment after clustering

From my answers at cross validated:


It's because df-colmeans(df) doesn't do what you think.

Let's try the code:

a=matrix(1:9,nrow=3)

[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

colMeans(a)

[1] 2 5 8

a-colMeans(a)

[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1

apply(a,2,function(x) x-mean(x))

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1

you'll find that a-colMeans(a) does a different thing than apply(a,2,function(x) x-mean(x)), which is what you'll want for centering.

You could write an apply to do the full autoscaling for you:

apply(a,2,function(x) (x-mean(x))/sd(x))

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1

scale(a)

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1

But there's no point in doing that apply, since scale will do it for you. :)


Moreover, to try out the clustering:

set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)

for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)

# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]

# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))

centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust

whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize

> table(whichMins)
whichMins
3
2532

HTH HAND,

Carl

Error in if (sum(abs(dc)) 1e-15) break : missing value where TRUE/FALSE needed: Kernel K-Means kernlab

This appears to be an issue with something randomly-generated internally by the function during your kkmeans() call. I don't have an answer for "why" this is happening and you'll likely have to check with the authors to determine if it's a bug or intended behavior.

While I reproduced your error with your data and code (running a fresh instance of R every time), the exact same function call also sometimes produces other errors and sometimes doesn't produce an error. However, whether it does so is entirely reproducible when you set.seed(), suggesting it is has something to do with starting values that determine other parameters of the model.

Below I show (a) that this can produce an alternative error (actually, I saw a third but didn't save the seed to reproduce it), (b) that even when it does "converge," it is producing pretty different clusters just on the basis of the random seed, and (c) the hyperparameter tuning is heavily influenced by the random number seed. I forgot to save the seed for the run where I was able to get some clustering results with 10 clusters.

I don't have an answer for why this happens: my hunch is that the automatically-generated settings are nonsensical/out of bounds in some cases and this is producing an error. This may be because your data are in some way strange or may be because the algorithm for setting the hyperparameter(s) doesn't make much sense. It could also be a bug, so perhaps worth posting as an issue.

In any case, a question to ask yourself is whether you want to use something where the behavior is this inconsistent at producing results, produces pretty different results across random seeds, and you don't know if the algorithm is actually doing what it says when it does, etc.

Example 1: clusters=5, no error, set.seed(123)

set.seed(123)
#> Hyperparameter : sigma = 0.463522505156128
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.53045 -21.18700 187.8918
#> [2,] 17.16138 -24.59687 184.7860
#> [3,] 15.73436 -17.87491 191.2586
#> [4,] 15.63425 -16.63862 192.0088
#> [5,] 16.19467 -20.16442 189.1617
#>
#> Cluster size:
#> [1] 11 8 11 8 12
#>
#> Within-cluster sum of squares:
#> [1] 537972.8 386310.2 544994.1 391965.9 604386.9

Example 2: clusters=5, no error, set.seed(3)

Works, but pretty different numbers of observations per cluster! Note the different hyperparameter.

#>  Hyperparameter : sigma =  0.290281708176631 
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.97636 -18.38464 190.5449
#> [2,] 16.24809 -20.10409 188.9572
#> [3,] 15.63660 -17.85633 191.5151
#> [4,] 17.06100 -22.70840 185.8834
#> [5,] 17.16138 -24.59687 184.7860
#>
#> Cluster size:
#> [1] 11 11 15 5 8
#>
#> Within-cluster sum of squares:
#> [1] 545547.7 538434.5 757947.0 236986.8 386310.2

Example 3: clusters=5, no error, set.seed(999)

Works, but pretty different numbers of observations per cluster! Note the different hyperparameter again!


#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.128189488632645
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.93157 -22.25171 186.4579
#> [2,] 15.45090 -15.99500 192.8452
#> [3,] 15.73677 -18.32277 191.0152
#> [4,] 17.16244 -24.44533 184.8376
#> [5,] 16.32218 -20.69291 188.5965
#>
#> Cluster size:
#> [1] 7 10 13 9 11
#>
#> Within-cluster sum of squares:
#> [1] 294630.1 457490.3 604486.8 441669.5 539478.6

Example 4: clusters = 10, new error, set.seed(99)

New error.

#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'affinMult' for signature '"rbfkernel", "numeric"'

Example 5: clusters = 10, new error, set.seed(3)

Original error.

#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed

Not included: additional error with clusters = 10 (not finding all of the columns in the matrix) and successfully getting some clusters with clusters = 10.

k-means clustered data: how to label newly incoming data

You dont need SVM thing.First way is more convenient.If you are using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html there is an example here.predict function will do your job.

assign cluster labels to data using a cluster assignment matrix

The vector is already binary. We can add 1L to the second column:

clust <- amat[,2] + 1L
[1] 1 1 2 2 1 1 1 2 2 2

(The suffix L coerces the value to integer)

How to define Kmeans cluster of the new data

Use the FilteredClusterer, and then choose KMeans in the Configuration Dialog of the FilteredClusterer.

Here is some text from the "More" button that shows some documentation about this clusterer:

NAME weka.clusterers.FilteredClusterer

SYNOPSIS Class for running an arbitrary clusterer on data that has
been passed through an arbitrary filter. Like the clusterer, the
structure of the filter is based exclusively on the training data

and test instances will be processed by the filter without changing
their structure.



Related Topics



Leave a reply



Submit