Text clustering with Levenshtein distances
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...)
, and we run heirarchal clustering using hclust(...)
. Then we cut the dendrogram into three clusters with cutree(...)
and append the cluster id's to the original strings.
Cluster algorithm with Levenshtein distance and additional features/variables
If you want a method for combining the metrics of Levenshtein and something like the Euclidean distance, you can do it by combining the distance matrices, as they are of the same shape, and send it to hclust.
stats <- cbind(df$nchars, df$n_nums)
euc <- as.matrix(dist(stats))
rownames(euc) <- x
lev <- adist(x)
rownames(lev) <- x
scale01 <- function(x) {
z <- (x - min(x))
z / max(z)
}
combi <- scale01(euc) + scale01(lev)
hc.combi <- hclust(as.dist(combi))
plot(hc.combi)
Of course you can weight the two matrices however you like.
If you want to combine k-means and hierarchical clustering I know of one way to do that. Essentially you perform hierarchical clustering on a matrix, divide it up into k groups, calculate the mean of each group and pass those means as the starting centroids for the k-means.
hc2 <- hclust(dist(stats))
clusters <- cutree(hc2, k=3)
centers <- aggregate(stats, list(clusters), mean)[, -1]
hkclust <- kmeans(stats, centers)
pairs(df, col=c(2:4)[hkclust$cluster])
If you want to combine k-means with Levenshtein, I'm afraid I don't know how to do that, as it doesn't make much sense to pass a distance matrix to k-means. Maybe k-medoids could work?
Text data clustering with python
sklearn actually does show this example using DBSCAN, just like Luke once answered here.
This is based on that example, using !pip install python-Levenshtein
.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.
from Levenshtein import distance
import numpy as np
from sklearn.cluster import dbscan
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
And if you pre-calculated you could define pre_lev_metric(x, y)
along the lines of
def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]
Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.
I had to install like this
!pip uninstall -y enum34
!pip install scikit-learn-extra
Than I was able to create clusters with;
from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
The labels/centers are in
kmedoids.labels_
kmedoids.cluster_centers_
Clustering words with Kmeans
from nltk.metrics import distance
import scipy.spatial as spatial
import numpy as np
from scipy.cluster.vq import kmeans
# sample vocabulary list
words = ['test', 'text', 'best', 'fast', 'context', 'boost', 'faster', 'border']
# similarity matrix
word_vectors = np.array([
[
distance.edit_distance(w, _w)
for _w in words
]
for w in words
], dtype=np.float)
centroids, _ = kmeans(word_vectors, k_or_guess=3)
word_clusters = np.argmin([
[spatial.distance.euclidean(wv, cv) for cv in centroids]
for wv in word_vectors
], 1)
for k in range(centroids.shape[0]):
print('k =', k)
print([word for i, word in enumerate(words) if word_clusters[i] == k])
It results as:
k = 0
['faster', 'border']
k = 1
['test', 'text', 'best', 'fast', 'boost']
k = 2
['context']
Remarks:
- Original vocabulary works as a feature list. The list of distance measures to other words works as a feature vector to any phrase or word.
- Each cluster is made in such feature space. Consequently, the distance between two words is not their Levenshtein Distance anymore but it is their distance in such space. This is why we use other measures such as
spatial.distance.euclidean
. - Kmean produces centroids in this feature space, each word is considered as a member to a cluster if the cluster centroid is the closest to the word (out of all other centroids).
np.argmin([...], 1)
is finding such assignment for each word. - Other clustering algorithms can be also tested on word-feature space. (List of some clustering algorithms in scikit-learn: https://scikit-learn.org/stable/modules/clustering.html)
Related Topics
Ggplot2 Shade Area Under Density Curve by Group
Specifying Column Names in a Data.Frame Changes Spaces to "."
How to Get Ranks with No Gaps When There Are Ties Among Values
Cumulative Sum Until Maximum Reached, Then Repeat from Zero in the Next Row
Select First Element of Nested List
How to Reorder Data.Table Columns (Without Copying)
Create Empty Data Frame with Column Names by Assigning a String Vector
R: += (Plus Equals) and ++ (Plus Plus) Equivalent from C++/C#/Java, etc.
How to Use Functions in One R Package Masked by Another Package
Using Row-Wise Column Indices in a Vector to Extract Values from Data Frame
Extract Elements Common in All Column Groups
Forward and Backward Fill Data Frame in R
Showing String in Formula and Not as Variable in Lm Fit