Predicting LDA topics for new data
With the help of Ben's superior document reading skills, I believe this is possible using the posterior() function.
library(topicmodels)
data(AssociatedPress)
train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]
train.lda <- LDA(train,5)
(train.topics <- topics(train.lda))
# [1] 4 5 5 1 2 3 1 2 1 2 1 3 2 3 3 2 2 5 3 4 5 3 1 2 3 1 4 4 2 5 3 2 4 5 1 5 4 3 1 3 4 3 2 1 4 2 4 3 1 2 4 3 1 1 4 4 5
# [58] 3 5 3 3 5 3 2 3 4 4 3 4 5 1 2 3 4 3 5 5 3 1 2 5 5 3 1 4 2 3 1 3 2 5 4 5 5 1 1 1 4 4 3
test.topics <- posterior(train.lda,test)
(test.topics <- apply(test.topics$topics, 1, which.max))
# [1] 3 5 5 5 2 4 5 4 2 2 3 1 3 3 2 4 3 1 5 3 5 3 1 2 2 3 4 1 2 2 4 4 3 3 5 5 5 2 2 5 2 3 2 3 3 5 5 1 2 2
How to use gensim topic modeling to predict new document?
import pandas as pd
train=pd.DataFrame({'text':['find the most representative document for each topic',
'topic distribution across documents',
'to help with understanding the topic',
'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
#using your train data to train the model with 4 topics
data_words = list(sent_to_words(train['text']))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4)
# predicting new text which is in text dataframe
new_text_corpus = id2word.doc2bow(text['text'][0].split())
lda[new_text_corpus]
#op
Out[75]:
[(0, 0.5517368), (1, 0.38150477), (2, 0.032756805), (3, 0.03400166)]
How to predict the topic of a new query using a trained LDA model using gensim?
I have written a function in python that gives the possible topic for a new query:
def getTopicForQuery (question):
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
topic_vec = []
topic_vec = lda[ques_vec]
word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
for i in range(len(topic_vec)):
word_count_array[i, 0] = topic_vec[i][0]
word_count_array[i, 1] = topic_vec[i][1]
idx = numpy.argsort(word_count_array[:, 1])
idx = idx[::-1]
word_count_array = word_count_array[idx]
final = []
final = lda.print_topic(word_count_array[0, 0], 1)
question_topic = final.split('*') ## as format is like "probability * topic"
return question_topic[1]
Before going through this do refer this link!
In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.
Then, the dictionary that was made by using our own database is loaded.
We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec]
where lda
is the trained model as explained in the link referred above.
The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1]
.
Carrying out an LDA and predict data
First split the data in train and test 70:30 like this:
library(MASS)
library(gclus)
set.seed(123)
ind <- sample(2, nrow(wine),replace = TRUE, prob = c(0.7, 0.3))
training <- wine[ind==1,]
testing <- wine[ind==2,]
Next, you can use the function lda
to perform a Linear discriminant analysis like this:
model1 <- lda(Class ~ Malic + Hue + Magnesium, training)
model2 <- lda(Class ~ Hue + Alcalinity + Phenols + Malic + Magnesium + Intensity + Nonflavanoid + Flavanoids, training)
At last you can predict on testset and check the results with a confusion matrix like this:
p1 <- predict(model1, testing)$class
tab <- table(Predicted = p1, Actual = testing$Class)
tab
Output:
Actual
Predicted 1 2 3
1 13 3 0
2 5 14 0
3 0 2 11
The accuracy is:
cat("Accuracy is:", sum(diag(tab))/sum(tab))
Accuracy is: 0.7916667
Related Topics
R Xml - Combining Parent and Child Nodes(W Same Name) into Data Frame
Shiny Doesn't Show Me the Entire Selectinput When I Have Choices > 1000
Increase Legend Font Size Ggplot2
Is It Bad Practice to Access S4 Objects Slots Directly Using @
Extract Text from Two-Column PDF with R
Obtain Latitude and Longitude from Address Without the Use of Google API
How to Optimize Read and Write to Subsections of a Matrix in R (Possibly Using Data.Table)
Add New Variable to List of Data Frames with Purrr and Mutate() from Dplyr
Installing a Package Offline from Github
Parse String with Additional Characters in Format to Date
How Many Non-Na Values in Each Row for a Matrix
How to Change the Default Font Size in Ggplot2
How to Tell What Packages You Have Used in R
R - Common Title and Legend for Combined Plots
Really Fast Word Ngram Vectorization in R