counting occurrence of particular letter in vector of words in r
Another posibility:
myvec <- c("A", "KILLS", "PASS", "JUMP", "BANANA", "AALU", "KPAL")
sapply(gregexpr("A", myvec, fixed = TRUE), function(x) sum(x > -1))
## [1] 1 0 1 0 3 2 1
EDIT This was begging for a benchmark:
library(stringr); library(stringi); library(microbenchmark); library(qdapDictionaries)
myvec <- toupper(GradyAugmented)
GREGEXPR <- function() sapply(gregexpr("A", myvec, fixed = TRUE), function(x) sum(x > -1))
GSUB <- function() nchar(gsub("[^A]", "", myvec))
STRSPLIT <- function() sapply(strsplit(myvec,""), function(x) sum(x=='A'))
STRINGR <- function() str_count(myvec, "A")
STRINGI <- function() stri_count(myvec, fixed="A")
VAPPLY_STRSPLIT <- function() vapply(strsplit(myvec,""), function(x) sum(x=='A'), integer(1))
(op <- microbenchmark(
GREGEXPR(),
GSUB(),
STRINGI(),
STRINGR(),
STRSPLIT(),
VAPPLY_STRSPLIT(),
times=50L))
## Unit: milliseconds
## expr min lq mean median uq max neval
## GREGEXPR() 477.278895 631.009023 688.845407 705.878827 745.73596 906.83006 50
## GSUB() 197.127403 202.313022 209.485179 205.538073 208.90271 270.19368 50
## STRINGI() 7.854174 8.354631 8.944488 8.663362 9.32927 11.19397 50
## STRINGR() 618.161777 679.103777 797.905086 787.554886 906.48192 1115.59032 50
## STRSPLIT() 244.721701 273.979330 331.281478 294.944321 348.07895 516.47833 50
## VAPPLY_STRSPLIT() 184.042451 206.049820 253.430502 219.107882 251.80117 595.02417 50
boxplot(op)
And stringi whooping some major tail. The vapply
+ strsplit
was a nice approach as was the simple gsub
approach. Interesting results for sure.
How to calculate the number of occurrence of a given character in each row of a column of strings?
The stringr package provides the str_count
function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
count occurrences among a set of words in r
We can use str_count
after paste
ing the vector
of 'words'
library(stringr)
df1$Scores <- str_count(df1$Col1, paste(words, collapse="|"))
df1$Scores
#[1] 3 3 3 2 0
Or another option is gregexpr
from base R
res <- gregexpr(paste0(words, collapse="|"), df1$Col1)
df1$Scores <- lengths(res) * !sapply(res, function(x) -1 %in% x)
data
words <- c("Mon", "Tues", "Wed")
df1 <- structure(list(Col1 = c("Mon,Tues,Wed,Thurs,Fri", "Mon,Tues,Wed,Thurs",
"Mon,Tues,Wed", "Mon,Tues", "Thurs")), .Names = "Col1",
class = "data.frame", row.names = c(NA,
-5L))
count the number of occurrences of ( in a string
(
is a special character. You need to escape it:
str_count(s,"\\(")
# [1] 3
Alternatively, given that you're using stringr
, you can use the coll
function:
str_count(s,coll("("))
# [1] 3
Count word occurrences in R
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Count occurrences of specific words from a dataframe row in R
I Assume this is what you require
Sample data
id <- c(1:4)
text <- c('I have a Dataset with 2 columns a',
'nd multiple rows. first column ID', 'second column the text which',
'n the text which belongs to it.')
dataset <- data.frame(id,text)
Function to find count
library(stringr)
getCount <- function(data,keyword)
{
wcount <- str_count(dataset$text, keyword)
return(data.frame(data,wcount))
}
Calling getCount should give the updated dataset
> getCount(dataset,'second')
id text wcount
1 I have a Dataset with 2 columns a 0
2 nd multiple rows. first column ID 0
3 second column the text which 1
4 n the text which belongs to it. 0
Count the number of all words in a string
You can use strsplit
and sapply
functions
sapply(strsplit(str1, " "), length)
Count occurrences of words in a string according to a category in R
Here is a base R method to get the count across types.
dataset$wcnt <- rowSums(sapply(c("dog|wolf", "cat|lion"),
function(x) grepl(x, dataset$text)))
Here, sapply
runs through the regular expressions of each type and feeds it to grepl
. This returns a matrix, where the columns are logical vectors indicating if a particular type (eg, "dog|wolf") was found. rowSums
sums the logicals along the rows to get the type variety count.
This returns
dataset
id text wcnt
1 1 saw a cat 1
2 2 found a dog 1
3 3 saw a cat by a dog 2
4 4 There was a lion 1
5 5 Huge wolf 1
If you want the intermediary step, returning logical vectors as variables in your data.frame, you would probably want to set your values up in a named vector and then do cbind
with the result.
# construct named vector
myTypes <- c("canine"="dog|wolf", "feline"="cat|lion")
# cbind sapply results of logicals to original data.frame
dataset <- cbind(dataset, sapply(myTypes, function(x) grepl(x, dataset$text)))
This returns
dataset
id text canine feline
1 1 saw a cat FALSE TRUE
2 2 found a dog TRUE FALSE
3 3 saw a cat by a dog TRUE TRUE
4 4 There was a lion FALSE TRUE
5 5 Huge wolf TRUE FALSE
Related Topics
How to Rbind All the Data.Frames in Your Working Environment
Web Scraping of Key Stats in Yahoo! Finance with R
Parsing Iso8601 Date and Time Format in R
Q-Q Plot with Ggplot2::Stat_Qq, Colours, Single Group
Can't Read an .Rdata Fileinput
Group Rows in Data Frame Based on Time Difference Between Consecutive Rows
Lm(): What Is Qraux Returned by Qr Decomposition in Linpack/Lapack
R Shiny Ggplot Bar and Line Charts with Dynamic Variable Selection and Y Axis to Be Percentages
Filling Under the a Curve with Ggplot Graphs
Download Plotly Using Downloadhandler
Extract Hyperlink from Excel File in R
Row-Wise Sum of Values Grouped by Columns with Same Name
Back-To-Back Barplot with Independent Axes R
Are Eigenvectors Returned by R Function Eigen() Wrong
X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values