Creating "Word" Cloud of Phrases, Not Individual Words in R

Creating word cloud of phrases, not individual words in R

Your difficulty is that each element of df$names is being treated as "document" by the functions of tm. For example, the document John A contains the words John and A. It sounds like you want to keep the names as is, and just count up their occurrence - you can just use table for that.

library(wordcloud)
df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C"))
tb<-table(df$theNames)
wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))

Sample Image

Creating a word cloud in R

Not sure exactly what you are doing but you can create a word cloud from a vector like this.

library(wordcloud)
library(tm)

data <- structure(c("", "newest", "managers", "are", "doing", "really", "well", 
                    "responses", "to", "client", "questions", "have", "been", "much", 
                    "better", "than", "expected", "for", "the", "short", "time", 
                    "they", "have", "been", "in", "their", "position", "", "trainee", 
                    "mentioned", "they", "didnt", "feel", "like", "they", "were", 
                    "getting", "enough", "supporthelp", "with", "the", "specific", 
                    "things", "their", "team", "does", "the", "team", "puts", "properties"
), .Dim = c(50L, 
            1L), .Dimnames = list(NULL, "billing"))
wordcloud(data)

Sample Image

Making a wordcloud, but with combined words?

Here's a solution using a different text package, that allows you to form multi-word expressions from either statistically detected collocations, or just by forming all bi-grams. The package is called quanteda.

library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’

First, the method for detecting the top 1,500 bigram collocations, and replacing these collocations in the texts with their single-token versions (concatenated by the "_" character). Here I am using the package's built-in corpus of the US presidential inaugural address texts.

### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)

# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.

# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)

# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete. 
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.

# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE, 
     colors = sample(colors()[2:128]))
dev.off()

This results in the following plot. Note the collocations, such as "generation's_task" and "fellow_americans".

The version formed with all bigrams is easier, but results in a huge number of low frequency bigram features. For the word cloud, I selected a larger set of texts, not just the 2013 Obama address.

### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete. 
## Elapsed time: 3.254 seconds.

# plot the bigram wordcloud - more texts because for a single speech, 
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE, 
     colors = sample(colors()[2:128]))
dev.off()

This produces:

Word cloud in R with multiple words and special characters

If you have a file as you specified with a variable name per line, there is no need to use tm. You can easily create your own word frequency table to use as input. When using tm, it will split words based a space and will not respect your variable names.

Starting from when the text is loaded, just create a data.frame with where frequency is set to 1 and then you can just aggregate everything. wordcloud also accepts data.frame like this and you can just create a wordcloud from this. Note that I adjusted the scale a bit, because when you have long variable names, they might not get printed. You will get a warning message when this happens.

I'm not inserting the resulting picture.

#text <- readLines("./Overview_used_series.txt")
text <- c("S & P 500 dividend yield", "S & P 500 dividend yield", "S & P 500 dividend yield", 
          "visualize ", "occurence ", "variable names", "visualize ", "occurence ", 
          "variable names")

# freq = 1 adds a columns with just 1's for every value.
my_data <- data.frame(text = text, freq = 1, stringsAsFactors = FALSE)

# aggregate the data.    
my_agr <- aggregate(freq ~ ., data = my_data, sum)

wordcloud(words = my_agr$text, freq = my_agr$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"), scale = c(2, .5))

R How to make a word cloud from a tally?

Using rep and colSums:

words <- rep(names(word_tally), colSums(word_tally))
words
 [1] "scarred"  "scarred"  "happy"    "cheerful" "mad"     
 [6] "mad"      "mad"      "mad"      "mad"      "curious" 
[11] "curious"  "curious"  "curious"

Or since the frequencies are the column sums, using just the data.

wordcloud(names(word_tally), freq=colSums(word_tally), min.freq = 1)

Sample Image

Plotting sentences in Wordcloud in R

To plot all the sentences you would need to reduce the scale value.

library(wordcloud)

wordcloud(data$grp, data$freq, scale=c(.5,.3), random.order=TRUE, 
          colors="black", vfont=c("sans serif","plain"))

Removing Words from word cloud in R

You could simply filter word_freqs before constructing the data.frame:

word_freqs <- word_freqs[word_freqs > 2]

Creating "Word" Cloud of Phrases, Not Individual Words in R