Text Mining in R | Memory Management

Text Mining in R | memory management

@Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix. Each number cell in a matrix in R consumes 8 bytes. Based on the size of the document term matrix in the question, the math looks like:

> # 
> # calculate memory consumed by matrix
> #
>
> rows <- 472029 #
> cols <- 171548
> # memory in gigabytes
> rows * cols * 8 / (1024 * 1024 * 1024)
[1] 603.3155

If you want to calculate the word frequencies, you're better off generating 1-grams and then summarizing them into a frequency distribution.

With the quanteda package the code would look like this.

words <- tokenize(...) 
ngram1 <- unlist(tokens_ngrams(words,n=1))
ngram1freq <- data.frame(table(ngram1))

regards,

Len

2017-11-24 UPDATE: Here is a complete example from the quanteda package that generates the frequency distribution from a document feature matrix using the textstat_frequency() function, as well as a barplot() for the top 20 features.

This approach does not require the generation & aggregation of n-grams into a frequency distribution.

library(quanteda)
myCorpus <- corpus(data_char_ukimmig2010)
system.time(theDFM <- dfm(myCorpus,tolower=TRUE,
remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))
system.time(textFreq <- textstat_frequency(theDFM))

hist(textFreq$frequency,
main="Frequency Distribution of Words: UK 2010 Election Manifestos")

top20 <- textFreq[1:20,]
barplot(height=top20$frequency,
names.arg=top20$feature,
horiz=FALSE,
las=2,
main="Top 20 Words: UK 2010 Election Manifestos")

...and the resulting barplot:

Sample Image

Text mining help in R

As I note in my answer to Text Mining in R | memory management, each cell in a matrix in R consumes 8 bytes. Therefore, the size of the matrix (in bytes) is number of documents * number of terms * 8. When one converts a sparse document term matrix to a full matrix, the empty cells in the sparse DTM consume lots of RAM.

Based on the data you've provided in the question, there are approximately 165,459 terms in the DTM you're trying to convert to a matrix.

> sizeInGb <- 466.8
> docs <- 378661
> # calculate number of terms in DTM
> sizeInGb * (1024 * 1024 * 1024) / (docs * 8)
[1] 165458.9

Depending on the type of analysis you're trying to do, you'll need to either use tools from the text mining package that work with document term matrices to analyze the data, or aggregate to an object that is smaller than the amount of RAM you have on your machine (minus the RAM consumed by the object(s) used to create it).

Text Mining PDFs - Convert List of Character Vectors (Strings) to Dataframe

This should do the trick:

#dummy data generation: file names and a list of strings (your corpus)    
files <- paste("file", 1:6)

strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))

# file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a" "b" "c" "d" "e" "f"

Edit based on data structure edit

files <- paste("file", 1:6)

strings <- list(c("a","b"),c("c", "d"),c("e","f"),
c("g","h"), c("i","j"), c("k", "l"))

names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " ")))

# file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b" "c d" "e f" "g h" "i j" "k l"

word2vec for text mining categories

Not word2vec, but have an alternative look at this post:

library(XML)
library(dplyr)
library(RecordLinkage)
df <- data.frame(words=capture.output(htmlParse("https://stackoverflow.com/questions/35904182/word2vec-for-text-mining-categories")[["//div/pre/code/text()"]]))
df %>% compare.dedup(strcmp = TRUE) %>%
epiWeights() %>%
epiClassify(0.8) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
left_join(mutate(df,ID = 1:nrow(df)),
select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)),
by=c("ID"="id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
select(-id1) -> dfnew
head(dfnew, 30)
# words ID
# 1 .NET 1
# 2 ABAP 2
# 3 Access 3
# 4 Account Management 4 # <--
# 5 Accounting 4 # <--
# 6 Active Directory 6
# 7 Agile Methodologies 7 # <--
# 8 Agile Project Management 7 # <--
# 9 AJAX 9
# 10 Algorithms 10
# 11 Analysis 11
# 12 Android 12 # <--
# 13 Android Development 12 # <--
# 14 AngularJS 14
# 15 Ant 15
# 16 Apache 16
# 17 ASP 17 # <--
# 18 ASP.NET 17 # <--
# 19 B2B 19
# 20 Banking 20
# 21 BPMN 21
# 22 Budgets 22
# 23 Business Analysis 23 # <--
# 24 Business Development 23 # <--
# 25 Business Intelligence 23 # <--
# 26 Business Planning 23 # <--
# 27 Business Process 23 # <--
# 28 Business Process Design 23 # <--
# 29 Business Process... 23 # <--
# 30 Business Strategy 23 # <--

dfnew$ID may be your abstract category based on jaro-winkler string distances. May need some fine tuning though for your real data.



Related Topics



Leave a reply



Submit