How to Show Corpus Text in R Tm Package

How to show corpus text in R tm package?

You can try converting your corpus text into a dataframe, and accessing the required text from the dataframe itself. I have used the built-in sample data "crude" (from the tm package) as an example.

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

R tm package select huge amount of words to keep in text corpus

I looked at your requirements and maybe a combination to tm and quanteda can help. See below.

Once you have a list of frequent words you can use quanteda in parallel to get the bigrams.

library(quanteda)

# set number of threads 
quanteda_options(threads = 4) 

my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed

# Use list of frequent words from tm. 
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)

# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)

textstat_frequency(dfm(bitoks)) # ordered from high to low

   feature frequency rank docfreq group
1    to_to        41    1      12   all
2    to_of        35    2      15   all
3   oil_to        33    3      17   all
4    to_in        32    4      12   all
5    of_to        29    5      14   all
6    in_to        28    6      11   all
7    in_of        21    7       8   all
8   to_oil        21    7      13   all
9    of_in        21    7      10   all
10  of_oil        20   10      14   all
11   of_of        20   10       8   all
12  in_oil        19   12      10   all
13  oil_in        18   13      11   all
14  oil_of        18   13      11   all
15   in_in        14   15       9   all
16 oil_oil        13   16      10   all

quanteda does have a topfeatures function, but it doesn't work like findfreqterms. Otherwise you could do it completely in quanteda.

If the dfm generation is taking too much memory, you can use as.character to transform the token object and use this either in dplyr or data.table. See code below.

library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>% 
  group_by(features) %>% 
  tally()

library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]

Corpus object missing text

I'll be answearing my own question with this:

writeLines(as.character(abstract[[1]]))
content(abstract[[1]])

But still don't know how to get the full column as an outcome.

How to Show Corpus Text in R Tm Package