Can there be overlap in k-means clusters?
K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition.
Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.
However for given distance d>0
a sample may be within d
-distance to more than one cluster center (it is possible). This is what you see when you say overlap. However still the sample is assigned to closest cluster not to all of them. So no overlap.
NOTE: In the case where a sample has exactly same closest distance to more than one cluster center any random assignment can be made between the closest clusters and this changes nothing important in the algorithm or results since clusters are re-computed after assignment.
Problems with cluster assignment after clustering
From my answers at cross validated:
It's because df-colmeans(df)
doesn't do what you think.
Let's try the code:
a=matrix(1:9,nrow=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
colMeans(a)
[1] 2 5 8
a-colMeans(a)
[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1
apply(a,2,function(x) x-mean(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
you'll find that a-colMeans(a)
does a different thing than apply(a,2,function(x) x-mean(x))
, which is what you'll want for centering.
You could write an apply
to do the full autoscaling for you:
apply(a,2,function(x) (x-mean(x))/sd(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
scale(a)
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1
But there's no point in doing that apply, since scale
will do it for you. :)
Moreover, to try out the clustering:
set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)
for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)
# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]
# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))
centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust
whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize
> table(whichMins)
whichMins
3
2532
HTH HAND,
Carl
Related Topics
R Windows Os Choose.Dir() File Chooser Won't Open at Working Directory
How to Scrape Items Together So You Don't Lose the Index
Converting Yearmon Column to Last Date of the Month in R
Finding Close Match from Data Frame 1 in Data Fame 2
R Doesn't Recognize Pandoc Linux Mint
Text Mining in R | Memory Management
How to Make Shiny's Input$Var Consumable for Dplyr::Summarise()
Sample Function Gives Different Result in Console and in Knitted Document When Seed Is Set
Get Names of Column with Max Value for Each Row
How to Know Which Cluster Do the New Data Belongs to After Finishing Cluster Analysis
Making Multiple Style References in Google Maps API
Meaning of Tilde and Dot Notation in Dplyr
Adding Counts of a Factor to a Dataframe
Tidyr::Pivot_Wider() Reorder Column Names Grouping by 'Name_From'
Dplyr::Select() with Some Variables That May Not Exist in the Data Frame