Creating word cloud of phrases, not individual words in R
Your difficulty is that each element of df$names
is being treated as "document" by the functions of tm
. For example, the document John A
contains the words John
and A
. It sounds like you want to keep the names as is, and just count up their occurrence - you can just use table
for that.
library(wordcloud)
df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C"))
tb<-table(df$theNames)
wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))
Creating a word cloud in R
Not sure exactly what you are doing but you can create a word cloud from a vector like this.
library(wordcloud)
library(tm)
data <- structure(c("", "newest", "managers", "are", "doing", "really", "well",
"responses", "to", "client", "questions", "have", "been", "much",
"better", "than", "expected", "for", "the", "short", "time",
"they", "have", "been", "in", "their", "position", "", "trainee",
"mentioned", "they", "didnt", "feel", "like", "they", "were",
"getting", "enough", "supporthelp", "with", "the", "specific",
"things", "their", "team", "does", "the", "team", "puts", "properties"
), .Dim = c(50L,
1L), .Dimnames = list(NULL, "billing"))
wordcloud(data)
Making a wordcloud, but with combined words?
Here's a solution using a different text package, that allows you to form multi-word expressions from either statistically detected collocations, or just by forming all bi-grams. The package is called quanteda.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’
First, the method for detecting the top 1,500 bigram collocations, and replacing these collocations in the texts with their single-token versions (concatenated by the "_"
character). Here I am using the package's built-in corpus of the US presidential inaugural address texts.
### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)
# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.
# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)
# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete.
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.
# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
This results in the following plot. Note the collocations, such as "generation's_task" and "fellow_americans".
The version formed with all bigrams is easier, but results in a huge number of low frequency bigram features. For the word cloud, I selected a larger set of texts, not just the 2013 Obama address.
### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete.
## Elapsed time: 3.254 seconds.
# plot the bigram wordcloud - more texts because for a single speech,
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
This produces:
Word cloud in R with multiple words and special characters
If you have a file as you specified with a variable name per line, there is no need to use tm. You can easily create your own word frequency table to use as input. When using tm, it will split words based a space and will not respect your variable names.
Starting from when the text is loaded, just create a data.frame with where frequency is set to 1 and then you can just aggregate everything. wordcloud
also accepts data.frame like this and you can just create a wordcloud from this. Note that I adjusted the scale a bit, because when you have long variable names, they might not get printed. You will get a warning message when this happens.
I'm not inserting the resulting picture.
#text <- readLines("./Overview_used_series.txt")
text <- c("S & P 500 dividend yield", "S & P 500 dividend yield", "S & P 500 dividend yield",
"visualize ", "occurence ", "variable names", "visualize ", "occurence ",
"variable names")
# freq = 1 adds a columns with just 1's for every value.
my_data <- data.frame(text = text, freq = 1, stringsAsFactors = FALSE)
# aggregate the data.
my_agr <- aggregate(freq ~ ., data = my_data, sum)
wordcloud(words = my_agr$text, freq = my_agr$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), scale = c(2, .5))
R How to make a word cloud from a tally?
Using rep
and colSums
:
words <- rep(names(word_tally), colSums(word_tally))
words
[1] "scarred" "scarred" "happy" "cheerful" "mad"
[6] "mad" "mad" "mad" "mad" "curious"
[11] "curious" "curious" "curious"
Or since the frequencies are the column sums, using just the data.
wordcloud(names(word_tally), freq=colSums(word_tally), min.freq = 1)
Plotting sentences in Wordcloud in R
To plot all the sentences you would need to reduce the scale
value.
library(wordcloud)
wordcloud(data$grp, data$freq, scale=c(.5,.3), random.order=TRUE,
colors="black", vfont=c("sans serif","plain"))
Removing Words from word cloud in R
You could simply filter word_freqs
before constructing the data.frame:
word_freqs <- word_freqs[word_freqs > 2]
Related Topics
How to Annotate Ggplot2 Qplot Outside of Legend and Plotarea? (Similar to Mtext())
R: in Barplot Midpoints Are Not Centered W.R.T. Bars
Counting the Number of Values Greater Than 0 in R in Multiple Columns
Writing a Function to Calculate the Mean of Columns in a Dataframe in R
Looping Over Combinations of Regression Model Terms
R Dplyr Subset with Missing Columns
How to Plot Charts with Nested Categories Axes
Wordcloud Package: Get "Error in Strwidth(…):Invalid 'Cex' Value"
Replacing for Loop with Foreach Loop
Changing the Order of Dodged Bars in Ggplot2 Barplot
Set Standard Legend Key Size with Long Label Names Ggplot
Convert Numeric Vector to Binary (0/1) Based on Limit
Data.Table: Sum by All Existing Combinations in Table
Http Error 400 on Google_Elevation() Call
Manual Simulation of Markov Chain in R
R Read Abbreviated Month Form a Date That Is Not in English
Add a Series of Elements in Different Locations Within a Vector