Find the Most Frequently Occuring Words in a Text in R

How do I find most frequent words by each observation in R?

Try this

library(tokenizers)
library(stopwords)
library(tidyverse)

# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)

# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)

# output
df

# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....

Data

dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)

How do I determine the most frequent word per variable in R?

Starting with the following dataframe:

head(df)
# A tibble: 6 x 4
Date Time Sender Message
<date> <chr> <chr> <fct>
1 2020-01-01 00:00:00 Person1 C
2 2020-01-01 01:00:00 Person1 C
3 2020-01-01 02:00:00 Person1 B
4 2020-01-01 03:00:00 Person1 B
5 2020-01-01 04:00:00 Person1 C
6 2020-01-01 05:00:00 Person1 E

You can first filter for specific hours by setting a Date_Time column using lubridate package and the function ymd_hms and use the filter function from dplyr to get only message sent between 9 AM and 5 PM.

library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)

# A tibble: 18 x 5
Date Time Sender Message Date_Time
<date> <chr> <chr> <fct> <dttm>
1 2020-01-01 09:00:00 Person1 C 2020-01-01 09:00:00
2 2020-01-01 10:00:00 Person1 E 2020-01-01 10:00:00
3 2020-01-01 11:00:00 Person1 C 2020-01-01 11:00:00
4 2020-01-01 12:00:00 Person1 C 2020-01-01 12:00:00
5 2020-01-01 13:00:00 Person1 A 2020-01-01 13:00:00
6 2020-01-01 14:00:00 Person1 D 2020-01-01 14:00:00
7 2020-01-01 15:00:00 Person1 A 2020-01-01 15:00:00
8 2020-01-02 16:00:00 Person1 A 2020-01-02 16:00:00
9 2020-01-02 17:00:00 Person1 E 2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D 2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E 2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E 2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C 2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A 2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B 2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E 2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E 2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D 2020-01-02 17:00:00

Then, you can group_by each sender and message to calculate the frequency of each message and then filter for the maximal frequency for each sender.

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender, Message) %>% count() %>%
group_by(Sender) %>%
filter(n == max(n))

# A tibble: 3 x 3
# Groups: Sender [2]
Sender Message n
<chr> <fct> <int>
1 Person1 A 3
2 Person1 C 3
3 Person2 E 4

If you want to know the number of messages sent by each sender in a certain period of time, you can do:

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender) %>% count()

# A tibble: 2 x 2
# Groups: Sender [2]
Sender n
<chr> <int>
1 Person1 9
2 Person2 9

Does it answer your question ?

Data

structure(list(Date = structure(c(18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263,
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263,
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00",
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00",
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00",
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00",
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00",
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00",
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00",
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00",
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"),
Sender = c("Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2"), Message = structure(c(3L,
3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L,
1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L,
1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L,
1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))

Write a function that finds the most common word in a string of text using R

Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.

# Create example string
string <- "This is a very short sentence. It has only a few words."

library(stringr)

most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}

most_common_word(string)
[1] "a"

Using R to find top ten words in a text

To get word frequency:

> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")

> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that The This very
3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
written
1

To ignore case:

> mytext = tolower(mytext)
>
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that this very written
4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>

For top ten words only:

> sort(table(mytext), decreasing=T)[1:10]
mytext
the count for test words a be been can checking
4 2 2 2 2 1 1 1 1 1

How to calculate most frequent occurring terms/words in a document collection/corpus using R?

You can use

sorted.sums[sorted.sums > 5][1:4]

But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4] should work as well.

To get the words you can use names.

names(sorted.sums[sorted.sums > 5][1:4])

R: find most frequent group of words in corpus

If I remember correctly, you can construct a TermDocumentMatrix of Bigrams (2 words that always occur together) using weka, and then process them as needed

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

Alternatively,

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

Sometimes it helps to perform word stemming first in order to get "better" word groups.

how to get most common phrases or words in python or R

I would advise this if you plan to use R: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html



Related Topics



Leave a reply



Submit