Text Clustering with Levenshtein Distances

Text clustering with Levenshtein distances

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

Sample Image

In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

Cluster algorithm with Levenshtein distance and additional features/variables

If you want a method for combining the metrics of Levenshtein and something like the Euclidean distance, you can do it by combining the distance matrices, as they are of the same shape, and send it to hclust.

stats <- cbind(df$nchars, df$n_nums)

euc <- as.matrix(dist(stats))
rownames(euc) <- x

lev <- adist(x)
rownames(lev) <- x

scale01 <- function(x) {
z <- (x - min(x))
z / max(z)
}

combi <- scale01(euc) + scale01(lev)

hc.combi <- hclust(as.dist(combi))
plot(hc.combi)

Of course you can weight the two matrices however you like.

If you want to combine k-means and hierarchical clustering I know of one way to do that. Essentially you perform hierarchical clustering on a matrix, divide it up into k groups, calculate the mean of each group and pass those means as the starting centroids for the k-means.

hc2 <- hclust(dist(stats))
clusters <- cutree(hc2, k=3)

centers <- aggregate(stats, list(clusters), mean)[, -1]

hkclust <- kmeans(stats, centers)
pairs(df, col=c(2:4)[hkclust$cluster])

If you want to combine k-means with Levenshtein, I'm afraid I don't know how to do that, as it doesn't make much sense to pass a distance matrix to k-means. Maybe k-medoids could work?

Text data clustering with python

sklearn actually does show this example using DBSCAN, just like Luke once answered here.

This is based on that example, using !pip install python-Levenshtein.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]

Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

I had to install like this

!pip uninstall -y enum34
!pip install scikit-learn-extra

Than I was able to create clusters with;

from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

The labels/centers are in

kmedoids.labels_
kmedoids.cluster_centers_

Clustering words with Kmeans


from nltk.metrics import distance
import scipy.spatial as spatial
import numpy as np
from scipy.cluster.vq import kmeans

# sample vocabulary list
words = ['test', 'text', 'best', 'fast', 'context', 'boost', 'faster', 'border']

# similarity matrix
word_vectors = np.array([
[
distance.edit_distance(w, _w)
for _w in words
]
for w in words
], dtype=np.float)

centroids, _ = kmeans(word_vectors, k_or_guess=3)

word_clusters = np.argmin([
[spatial.distance.euclidean(wv, cv) for cv in centroids]
for wv in word_vectors
], 1)

for k in range(centroids.shape[0]):
print('k =', k)
print([word for i, word in enumerate(words) if word_clusters[i] == k])

It results as:

k = 0
['faster', 'border']
k = 1
['test', 'text', 'best', 'fast', 'boost']
k = 2
['context']

Remarks:

  1. Original vocabulary works as a feature list. The list of distance measures to other words works as a feature vector to any phrase or word.
  2. Each cluster is made in such feature space. Consequently, the distance between two words is not their Levenshtein Distance anymore but it is their distance in such space. This is why we use other measures such as spatial.distance.euclidean.
  3. Kmean produces centroids in this feature space, each word is considered as a member to a cluster if the cluster centroid is the closest to the word (out of all other centroids). np.argmin([...], 1) is finding such assignment for each word.
  4. Other clustering algorithms can be also tested on word-feature space. (List of some clustering algorithms in scikit-learn: https://scikit-learn.org/stable/modules/clustering.html)


Related Topics



Leave a reply



Submit