How to Show Corpus Text in R Tm Package

How to show corpus text in R tm package?

You can try converting your corpus text into a dataframe, and accessing the required text from the dataframe itself. I have used the built-in sample data "crude" (from the tm package) as an example.

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

R tm package select huge amount of words to keep in text corpus

I looked at your requirements and maybe a combination to tm and quanteda can help. See below.

Once you have a list of frequent words you can use quanteda in parallel to get the bigrams.

library(quanteda)

# set number of threads
quanteda_options(threads = 4)

my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed

# Use list of frequent words from tm.
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)

# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)

textstat_frequency(dfm(bitoks)) # ordered from high to low

feature frequency rank docfreq group
1 to_to 41 1 12 all
2 to_of 35 2 15 all
3 oil_to 33 3 17 all
4 to_in 32 4 12 all
5 of_to 29 5 14 all
6 in_to 28 6 11 all
7 in_of 21 7 8 all
8 to_oil 21 7 13 all
9 of_in 21 7 10 all
10 of_oil 20 10 14 all
11 of_of 20 10 8 all
12 in_oil 19 12 10 all
13 oil_in 18 13 11 all
14 oil_of 18 13 11 all
15 in_in 14 15 9 all
16 oil_oil 13 16 10 all

quanteda does have a topfeatures function, but it doesn't work like findfreqterms. Otherwise you could do it completely in quanteda.

If the dfm generation is taking too much memory, you can use as.character to transform the token object and use this either in dplyr or data.table. See code below.

library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>%
group_by(features) %>%
tally()

library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]

Corpus object missing text

I'll be answearing my own question with this:

writeLines(as.character(abstract[[1]]))
content(abstract[[1]])

But still don't know how to get the full column as an outcome.



Related Topics



Leave a reply



Submit