List of Word Frequencies Using R

Extract total frequency of words from vector in R

posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players.  they have private message boards where it appears most of their work goes on.  i would bet they are posting more there than in jita speakers corner.  i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold.  its sort of like ccp used to post here on the forums then they stopped.  so they got a csm to represent players and use jita park forum to interact.  now the csm no longer posts there as they have their internal forums where they hash things out.  perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1

Counting Specific Word Frequency In R

You can group_by the word and count occurrences of each unique word and then subset the ones you want.

library(tidyverse)
data <- data.frame(word = c("rna",
"synthesis",
"resembles",
"copy",
"choice",
"rna",
"recombination",
"process",
"nascent",
"rna"))

counts <- data %>%
group_by(word) %>%
count()

counts[which(counts$word == "rna"),]

# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3

or using dplyr subsetting:

 counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3

Piping it all through at once:

 data %>% 
group_by(word) %>%
count() %>%
filter(word == "rna")

A one liner with data.table solution:

library(data.table)
setDT(data)
data[word == "rna", .N, by = word]

word N
1: rna 3

Text mining - word frequency from a single column containing list

If all you want or need is a frequency count, you can do without external packages, base R has a function table.

sp <- unlist(strsplit(as.character(unlist(tags_df$tags)), '^c\\(|,|"|\\)'))
inx <- sapply(sp, function(y) nchar(trimws(y)) > 0 & !is.na(y))
table(sp[inx])
# Android CSS3 Design Hiring JavaScript NextJS
# 1 1 1 1 4 1
# NodeJS programming Programming ReactJS Testing UI
# 1 1 3 3 1 1
# UX WebDesign webdev WebDev
# 1 2 1 4

EDIT.

I have just realized that you have "programming" and "Programming", "webdev" and "WebDev" as tags, maybe you want to do a case-insensitive count. If this is the case, try instead

table(tolower(sp[inx]))

Counting overall word frequency when each sentence is a separate row in a dataframe

You can just use table() on the unlisted strsplit() of your column

table(unlist(strsplit(df$Words, " ")))

# Luke Luker Sky Skywalker Syker Walk
# 3 1 1 1 1 2

and if you need it sorted

sort(table(unlist(strsplit(df$Words, " "))), decreasing = TRUE)

# Luke Walk Luker Sky Skywalker Syker
# 3 2 1 1 1 1

where df$words is your column of interest.

Frequency of each word in a set of strings

Pipes do the job.

df <- data.frame(column_x = c("hello world", "hello morning hello", 
"bye bye world"), stringsAsFactors = FALSE)
require(dplyr)
df$column_x %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>% # or strsplit(split = "\\W")
unlist() %>%
table() %>%
sort(decreasing = TRUE)

Grouping word frequency

I think you can do this with a simple dplyr call. For example

library(dplyr)
dd %>% group_by(Word) %>% summarize(Count=n_distinct(ID))
# Word Count
# <fct> <int>
# 1 cat 3
# 2 dog 1

Count word frequencies in list-of-lists-of-words

Ok tell me how that would work for you.

Using this data:

O <- structure(list(text.1 = list(character(0), c("access", "access", 
"access", "access")), text.2 = list(character(0), c("report",
"access", "access", "access")), text.3 = list(character(0), c("access",
"access", "access", "access")), text.4 = list(character(0), c("access",
"access", "access", "access")), text.5 = list(character(0), "access"),
text.6 = list(character(0), character(0)), text.7 = list(
character(0), c("report", "report", "access", "access",
"report", "report", "report", "report", "report", "report",
"data", "data", "report", "access", "report", "report"
)), text.8 = list(character(0), c("report", "access",
"access")), text.9 = list(character(0), "report"), text.10 = list(
NULL, c("report", "access", "access", "access", "report",
"access"))), .Names = c("text.1", "text.2", "text.3",
"text.4", "text.5", "text.6", "text.7", "text.8", "text.9", "text.10"
))

Since it seems the words are always in the second element of the text.x lists, we'll take those words and put them in a newlist. More than that, we'll turn those data into factors so we can regroup them into a dataframe later on.

newlist <- list()

for(item in O) {
newlist[[length(newlist)+1]] <- factor(item[[2]], levels = c("access", "data", "report"))
}

dd <- data.frame(lapply(newlist, table))
dd <- t(as.matrix(dd[,c(2,4,6,8,10,12,14,16,18,20)]))

rownames(dd) <- paste0("Text.",1:10)
colnames(dd) <- c("access", "data", "report")

dd

# access data report
# Text.1 4 0 0
# Text.2 3 0 1
# Text.3 4 0 0
# Text.4 4 0 0
# Text.5 1 0 0
# Text.6 0 0 0
# Text.7 3 2 11
# Text.8 2 0 1
# Text.9 0 0 1
# Text.10 4 0 2

Word frequency per document in R

library(dplyr)
library(tidyr)
library(stringi)

word__date =
data_frame(
comments= c("i want to hear that", "lets get started", "i want to get started"),
date = c("2010-11-01", "2008-03-25", "2007-03-14") %>% as.Date ) %>%
mutate(word = comments %>% stri_split_fixed(pattern = " ")) %>%
unnest(word) %>%
group_by(word, date) %>%
summarize(count = n())

word =
word__date %>%
summarize(count = sum(count))


Related Topics



Leave a reply



Submit