How do I find most frequent words by each observation in R?
Try this
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
Data
dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)
How do I determine the most frequent word per variable in R?
Starting with the following dataframe:
head(df)
# A tibble: 6 x 4
Date Time Sender Message
<date> <chr> <chr> <fct>
1 2020-01-01 00:00:00 Person1 C
2 2020-01-01 01:00:00 Person1 C
3 2020-01-01 02:00:00 Person1 B
4 2020-01-01 03:00:00 Person1 B
5 2020-01-01 04:00:00 Person1 C
6 2020-01-01 05:00:00 Person1 E
You can first filter for specific hours by setting a Date_Time column using lubridate
package and the function ymd_hms
and use the filter
function from dplyr
to get only message sent between 9 AM and 5 PM.
library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)
# A tibble: 18 x 5
Date Time Sender Message Date_Time
<date> <chr> <chr> <fct> <dttm>
1 2020-01-01 09:00:00 Person1 C 2020-01-01 09:00:00
2 2020-01-01 10:00:00 Person1 E 2020-01-01 10:00:00
3 2020-01-01 11:00:00 Person1 C 2020-01-01 11:00:00
4 2020-01-01 12:00:00 Person1 C 2020-01-01 12:00:00
5 2020-01-01 13:00:00 Person1 A 2020-01-01 13:00:00
6 2020-01-01 14:00:00 Person1 D 2020-01-01 14:00:00
7 2020-01-01 15:00:00 Person1 A 2020-01-01 15:00:00
8 2020-01-02 16:00:00 Person1 A 2020-01-02 16:00:00
9 2020-01-02 17:00:00 Person1 E 2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D 2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E 2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E 2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C 2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A 2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B 2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E 2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E 2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D 2020-01-02 17:00:00
Then, you can group_by
each sender and message to calculate the frequency of each message and then filter for the maximal frequency for each sender.
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender, Message) %>% count() %>%
group_by(Sender) %>%
filter(n == max(n))
# A tibble: 3 x 3
# Groups: Sender [2]
Sender Message n
<chr> <fct> <int>
1 Person1 A 3
2 Person1 C 3
3 Person2 E 4
If you want to know the number of messages sent by each sender in a certain period of time, you can do:
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender) %>% count()
# A tibble: 2 x 2
# Groups: Sender [2]
Sender n
<chr> <int>
1 Person1 9
2 Person2 9
Does it answer your question ?
Data
structure(list(Date = structure(c(18262, 18262, 18262, 18262,
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263,
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263,
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00",
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00",
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00",
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00",
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00",
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00",
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00",
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00",
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"),
Sender = c("Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2"), Message = structure(c(3L,
3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L,
1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L,
1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L,
1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
Write a function that finds the most common word in a string of text using R
Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.
# Create example string
string <- "This is a very short sentence. It has only a few words."
library(stringr)
most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}
most_common_word(string)
[1] "a"
Using R to find top ten words in a text
To get word frequency:
> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that The This very
3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
written
1
To ignore case:
> mytext = tolower(mytext)
>
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that this very written
4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
For top ten words only:
> sort(table(mytext), decreasing=T)[1:10]
mytext
the count for test words a be been can checking
4 2 2 2 2 1 1 1 1 1
How to calculate most frequent occurring terms/words in a document collection/corpus using R?
You can use
sorted.sums[sorted.sums > 5][1:4]
But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4]
should work as well.
To get the words you can use names
.
names(sorted.sums[sorted.sums > 5][1:4])
R: find most frequent group of words in corpus
If I remember correctly, you can construct a TermDocumentMatrix of Bigrams (2 words that always occur together) using weka, and then process them as needed
library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
# process tdm
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...
tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted
Alternatively,
#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))
Sometimes it helps to perform word stemming first in order to get "better" word groups.
how to get most common phrases or words in python or R
I would advise this if you plan to use R: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html
Related Topics
R Partial Reshape Data from Long to Wide
Using If Else Conditions on Vectors
Scale_Color_Manual Colors Won't Change
How to Manually Set Colors in a Bar Chart
Writing R Function with If Enviornment
Correct Positioning of Multiple Significance Labels on Dodged Groups in Ggplot
Parse String with Additional Characters in Format to Date
R Aggregate Data in One Column Based on 2 Other Columns
Weird As.Posixct Behavior Depending on Daylight Savings Time
Merge Two Dataframes If Timestamp of X Is Within Time Interval of Y
How to Produce Time Series for Each Row of a Data Frame with an Unnamed First Column
Separate Columns with Constant Numbers and Condense Them to One Row in R Data.Frame
R: How to Get the Last Element from Each Group
R: Expand and Fill Data Frame by Date in Series
Lookup Values Corresponding to the Closest Date
Subset Dataframe Such That All Values in Each Row Are Less Than a Certain Value