Find the Most Frequently Occuring Words in a Text in R

How do I find most frequent words by each observation in R?

Try this

library(tokenizers)
library(stopwords)
library(tidyverse)

# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description, 
                                 tokenize_words, 
                                 stopwords = stopwords(language = "en", source = "smart")), 
                          function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)

# tidyverse's job
df <- words_as_tokens %>%
  bind_rows(, .id = "name") %>%
  rename(word = x)

# output
df

#    name          word Freq
# 1  John    experience    2
# 2  John          word    2
# 3  John    absolutely    1
# 4  John        action    1
# 5  John        amazon    1
# 6  John     amazon.ae    1
# 7  John     answering    1
# ....
# 42 Alex         break    2
# 43 Alex          nice    2
# 44 Alex         times    2
# 45 Alex             8    1
# 46 Alex        accent    1
# 47 Alex        africa    1
# 48 Alex        agents    1
# ....

Data

dat <- data.frame(name = c("John", "Alex"),
                  description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
                                 "Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)

How do I determine the most frequent word per variable in R?

Starting with the following dataframe:

head(df)
# A tibble: 6 x 4
  Date       Time     Sender  Message
  <date>     <chr>    <chr>   <fct>  
1 2020-01-01 00:00:00 Person1 C      
2 2020-01-01 01:00:00 Person1 C      
3 2020-01-01 02:00:00 Person1 B      
4 2020-01-01 03:00:00 Person1 B      
5 2020-01-01 04:00:00 Person1 C      
6 2020-01-01 05:00:00 Person1 E

You can first filter for specific hours by setting a Date_Time column using lubridate package and the function ymd_hms and use the filter function from dplyr to get only message sent between 9 AM and 5 PM.

library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)

# A tibble: 18 x 5
   Date       Time     Sender  Message Date_Time          
   <date>     <chr>    <chr>   <fct>   <dttm>             
 1 2020-01-01 09:00:00 Person1 C       2020-01-01 09:00:00
 2 2020-01-01 10:00:00 Person1 E       2020-01-01 10:00:00
 3 2020-01-01 11:00:00 Person1 C       2020-01-01 11:00:00
 4 2020-01-01 12:00:00 Person1 C       2020-01-01 12:00:00
 5 2020-01-01 13:00:00 Person1 A       2020-01-01 13:00:00
 6 2020-01-01 14:00:00 Person1 D       2020-01-01 14:00:00
 7 2020-01-01 15:00:00 Person1 A       2020-01-01 15:00:00
 8 2020-01-02 16:00:00 Person1 A       2020-01-02 16:00:00
 9 2020-01-02 17:00:00 Person1 E       2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D       2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E       2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E       2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C       2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A       2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B       2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E       2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E       2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D       2020-01-02 17:00:00

Then, you can group_by each sender and message to calculate the frequency of each message and then filter for the maximal frequency for each sender.

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender, Message) %>% count() %>% 
  group_by(Sender) %>%
  filter(n == max(n))

# A tibble: 3 x 3
# Groups:   Sender [2]
  Sender  Message     n
  <chr>   <fct>   <int>
1 Person1 A           3
2 Person1 C           3
3 Person2 E           4

If you want to know the number of messages sent by each sender in a certain period of time, you can do:

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender) %>% count()

# A tibble: 2 x 2
# Groups:   Sender [2]
  Sender      n
  <chr>   <int>
1 Person1     9
2 Person2     9

Does it answer your question ?

Data

structure(list(Date = structure(c(18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263, 
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00", 
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00", 
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", 
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00", 
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00", 
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", 
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", 
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", 
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"), 
    Sender = c("Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2"), Message = structure(c(3L, 
    3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L, 
    1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L, 
    1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L, 
    1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA, 
-50L), class = c("tbl_df", "tbl", "data.frame"))

Write a function that finds the most common word in a string of text using R

Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.

# Create example string
string <- "This is a very short sentence. It has only a few words."

library(stringr)

most_common_word <- function(string){
  string1 <- str_split(string, pattern = " ")[[1]] # Split the string
  string2 <- str_trim(string1) # Remove white space
  string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
  string4 <- tolower(string3) # Convert to lower case
  word_count <- table(string4) # Count the word number
  return(names(word_count[which.max(word_count)][1])) # Report the most common word
}

most_common_word(string)
[1] "a"

Using R to find top ten words in a text

To get word frequency:

> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")

> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that      The     This     very 
       3        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
 written 
       1

To ignore case:

> mytext = tolower(mytext)
> 
> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that     this     very  written 
       4        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
>

For top ten words only:

> sort(table(mytext), decreasing=T)[1:10]
mytext
     the    count      for     test    words        a       be     been      can checking 
       4        2        2        2        2        1        1        1        1        1

How to calculate most frequent occurring terms/words in a document collection/corpus using R?

You can use

sorted.sums[sorted.sums > 5][1:4]

But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4] should work as well.

To get the words you can use names.

names(sorted.sums[sorted.sums > 5][1:4])

R: find most frequent group of words in corpus

If I remember correctly, you can construct a TermDocumentMatrix of Bigrams (2 words that always occur together) using weka, and then process them as needed

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm 
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

Alternatively,

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

Sometimes it helps to perform word stemming first in order to get "better" word groups.

how to get most common phrases or words in python or R

I would advise this if you plan to use R: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html