How to Know Which Cluster Do the New Data Belongs to After Finishing Cluster Analysis

Can there be overlap in k-means clusters?

K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition.

Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.

However for given distance d>0 a sample may be within d-distance to more than one cluster center (it is possible). This is what you see when you say overlap. However still the sample is assigned to closest cluster not to all of them. So no overlap.

NOTE: In the case where a sample has exactly same closest distance to more than one cluster center any random assignment can be made between the closest clusters and this changes nothing important in the algorithm or results since clusters are re-computed after assignment.

Problems with cluster assignment after clustering

From my answers at cross validated:


It's because df-colmeans(df) doesn't do what you think.

Let's try the code:

a=matrix(1:9,nrow=3)

[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

colMeans(a)

[1] 2 5 8

a-colMeans(a)

[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1

apply(a,2,function(x) x-mean(x))

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1

you'll find that a-colMeans(a) does a different thing than apply(a,2,function(x) x-mean(x)), which is what you'll want for centering.

You could write an apply to do the full autoscaling for you:

apply(a,2,function(x) (x-mean(x))/sd(x))

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1

scale(a)

[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1

But there's no point in doing that apply, since scale will do it for you. :)


Moreover, to try out the clustering:

set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)

for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)

# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]

# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))

centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust

whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize

> table(whichMins)
whichMins
3
2532

HTH HAND,

Carl



Related Topics



Leave a reply



Submit