Predicting Lda Topics for New Data

Predicting LDA topics for new data

With the help of Ben's superior document reading skills, I believe this is possible using the posterior() function.

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]

train.lda <- LDA(train,5)
(train.topics <- topics(train.lda))
#  [1] 4 5 5 1 2 3 1 2 1 2 1 3 2 3 3 2 2 5 3 4 5 3 1 2 3 1 4 4 2 5 3 2 4 5 1 5 4 3 1 3 4 3 2 1 4 2 4 3 1 2 4 3 1 1 4 4 5
# [58] 3 5 3 3 5 3 2 3 4 4 3 4 5 1 2 3 4 3 5 5 3 1 2 5 5 3 1 4 2 3 1 3 2 5 4 5 5 1 1 1 4 4 3

test.topics <- posterior(train.lda,test)
(test.topics <- apply(test.topics$topics, 1, which.max))
#  [1] 3 5 5 5 2 4 5 4 2 2 3 1 3 3 2 4 3 1 5 3 5 3 1 2 2 3 4 1 2 2 4 4 3 3 5 5 5 2 2 5 2 3 2 3 3 5 5 1 2 2

How to use gensim topic modeling to predict new document?

import pandas as pd 
train=pd.DataFrame({'text':['find the most representative document for each topic',
                        'topic distribution across documents',
                        'to help with understanding the topic',
                        'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})

def sent_to_words(sentences):
for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

#using your train data to train the model with 4 topics

data_words = list(sent_to_words(train['text']))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                        id2word=id2word,
                                        num_topics=4)

#  predicting new text which is in text dataframe  
new_text_corpus =  id2word.doc2bow(text['text'][0].split())
lda[new_text_corpus]

#op

Out[75]:
[(0, 0.5517368), (1, 0.38150477), (2, 0.032756805), (3, 0.03400166)]

How to predict the topic of a new query using a trained LDA model using gensim?

I have written a function in python that gives the possible topic for a new query:

def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')

    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)

    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)

    dictionary = corpora.Dictionary.load('questions.dict')

    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)

    topic_vec = []
    topic_vec = lda[ques_vec]

    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]

    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]

    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)

    question_topic = final.split('*') ## as format is like "probability * topic"

    return question_topic[1]

Before going through this do refer this link!

In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.

Then, the dictionary that was made by using our own database is loaded.

We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.

The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].

Carrying out an LDA and predict data

First split the data in train and test 70:30 like this:

library(MASS) 
library(gclus)
set.seed(123)
ind <- sample(2, nrow(wine),replace = TRUE, prob = c(0.7, 0.3))
training <- wine[ind==1,]
testing <- wine[ind==2,]

Next, you can use the function lda to perform a Linear discriminant analysis like this:

model1 <- lda(Class ~ Malic + Hue + Magnesium, training)
model2 <- lda(Class ~ Hue + Alcalinity + Phenols + Malic + Magnesium + Intensity + Nonflavanoid + Flavanoids, training)

At last you can predict on testset and check the results with a confusion matrix like this:

p1 <- predict(model1, testing)$class
tab <- table(Predicted = p1, Actual = testing$Class)
tab

Output:

         Actual
Predicted  1  2  3
        1 13  3  0
        2  5 14  0
        3  0  2 11

The accuracy is:

cat("Accuracy is:", sum(diag(tab))/sum(tab))

Accuracy is: 0.7916667

Predicting Lda Topics for New Data