Lda with Topicmodels, How to See Which Topics Different Documents Belong To

LDA with topicmodels, how can I see which topics different documents belong to?

How about this, using the built-in dataset. This will show you what documents belong to which topic with the highest probability.

library(topicmodels)
data("AssociatedPress", package = "topicmodels")

k <- 5 # set number of topics
# generate model
lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k)
# now we have a topic model with 20 docs and five topics

# make a data frame with topics as cols, docs as rows and
# cell values as posterior topic distribution for each document
gammaDF <- as.data.frame(lda@gamma)
names(gammaDF) <- c(1:k)
# inspect...
gammaDF
1 2 3 4 5
1 8.979807e-05 8.979807e-05 9.996408e-01 8.979807e-05 8.979807e-05
2 8.714836e-05 8.714836e-05 8.714836e-05 8.714836e-05 9.996514e-01
3 9.261396e-05 9.996295e-01 9.261396e-05 9.261396e-05 9.261396e-05
4 9.995437e-01 1.140774e-04 1.140774e-04 1.140774e-04 1.140774e-04
5 3.573528e-04 3.573528e-04 9.985706e-01 3.573528e-04 3.573528e-04
6 5.610659e-05 5.610659e-05 5.610659e-05 5.610659e-05 9.997756e-01
7 9.994345e-01 1.413820e-04 1.413820e-04 1.413820e-04 1.413820e-04
8 4.286702e-04 4.286702e-04 4.286702e-04 9.982853e-01 4.286702e-04
9 3.319338e-03 3.319338e-03 9.867226e-01 3.319338e-03 3.319338e-03
10 2.034781e-04 2.034781e-04 9.991861e-01 2.034781e-04 2.034781e-04
11 4.810342e-04 9.980759e-01 4.810342e-04 4.810342e-04 4.810342e-04
12 2.651256e-04 9.989395e-01 2.651256e-04 2.651256e-04 2.651256e-04
13 1.430945e-04 1.430945e-04 1.430945e-04 9.994276e-01 1.430945e-04
14 8.402940e-04 8.402940e-04 8.402940e-04 9.966388e-01 8.402940e-04
15 8.404830e-05 9.996638e-01 8.404830e-05 8.404830e-05 8.404830e-05
16 1.903630e-04 9.992385e-01 1.903630e-04 1.903630e-04 1.903630e-04
17 1.297372e-04 1.297372e-04 9.994811e-01 1.297372e-04 1.297372e-04
18 6.906241e-05 6.906241e-05 6.906241e-05 9.997238e-01 6.906241e-05
19 1.242780e-04 1.242780e-04 1.242780e-04 1.242780e-04 9.995029e-01
20 9.997361e-01 6.597684e-05 6.597684e-05 6.597684e-05 6.597684e-05


# Now for each doc, find just the top-ranked topic
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
# inspect...
toptopics
document topic
1 1 2
2 2 5
3 3 1
4 4 4
5 5 4
6 6 5
7 7 2
8 8 4
9 9 1
10 10 2
11 11 3
12 12 1
13 13 1
14 14 2
15 15 1
16 16 4
17 17 4
18 18 3
19 19 4
20 20 3

Is that what you want to do?

Hat-tip to this answer: https://stat.ethz.ch/pipermail/r-help/2010-August/247706.html

Topic distribution: How do we see which document belong to which topic after doing LDA in python

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

Just to make it clearer:

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
for topic in doc:
for topic_id, score in topic:
scores.append(score)
threshold = sum(scores)/len(scores)

The above code is sum the score of all words and in all topics for all documents.
Then normalize the sum by the number of scores.

How to map topic to a document after topic modeling is done with LDA?

Using Quanteda You can achieve this as follows

dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10). #10 topics in this case

Then you can obtain the most likely topics using the command topics() and save them as a document-level variable.

docvars(dfmat_news, 'topic') <- topics(lda)
head(topics(lda), 20)

here the tutorial : https://tutorials.quanteda.io/machine-learning/topicmodel/

hope it is clear and useful :)

LDA with topicmodels package for R, how do I get the topic probability for each term?

You can use posterior()$terms to get the posterior probability for each term. posterior()$topics gives the probability for documents.

Example adapted from help(LDA):

data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], k = 2)
terms <- posterior(lda)$terms

## posterior probability for the first 5 terms (alphabetically)
terms[,1:5]
aaron abandon abandoned abandoning abbott
1 3.720076e-44 3.720076e-44 3.720076e-44 3.720076e-44 3.720076e-44
2 3.720076e-44 3.720076e-44 3.720076e-44 3.720076e-44 3.720076e-44


Related Topics



Leave a reply



Submit