Extract English Words from a Text in R

Extract English words from a text in R

You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:

vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")

library(qdapDictionaries)
vector[vector %in% GradyAugmented]

## [1] "picture" "carpet"  "lamp"

intersect(vector, GradyAugmented)

## [1] "picture" "carpet"  "lamp"

The error you are receiving with installing qdap sounds like @Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

How to split English letters, numbers and Chinese characters in R?

To extract the chinese words only,
We could use str_extract: extracting all non latin characters with "[:alpha:]+":

library(stringr)

string <- c("123-321-中文.jpg", "001-123你好.png")

str_extract(string, "[:alpha:]+")

output:

[1] "中文" "你好"

Extract only words containing ASCII characters from vector of strings

Use sapply with paste as in:

b<-str_extract_all(c('hello ringпрг','trust'), regex("[a-z]+", TRUE))

sapply(b, paste, collapse = " ")

## [1] "hello ring" "trust"

stringr: extract words containing a specific word

You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use

> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:

(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)

If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":

> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12" 

[[2]]
[1] "WIFF2"   "BIGWIFF"

> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

regex: extract segments of a string containing a word, between symbols

With stringr ...


library(stringr)
library(dplyr)

dataframe %>% 
   mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
  text               
  <chr>              
1 some words to keep 
2 other stuff to keep

^{Created on 2022-02-01 by the reprex package (v2.0.1)}

Extract letters from a string in R

you can try

sub("^([[:alpha:]]*).*", "\\1", x)
[1] "AB"  "GF"  "ABC"

Extract string up to a different word in each row - R

Loop over the 'words' column, get the matching 'stringlist' value with grep, use sub to capture the characters including the word and replace it with backreference (\\1) of the captured group

df$new_words <- sapply(df$words, function(x) 
    sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist, 
     value = TRUE)[1]))

-output

> df
   words                  new_words
1  apple      eukaryote;plant;apple
2  plant            eukaryote;plant
3 banana     eukaryote;plant;banana
4 animal           eukaryote;animal
5    fly       eukaryote;insect;fly
6  ecoli prokaryote;bacterium;ecoli

data

df <- structure(list(words = c("apple", "plant", "banana", "animal", 
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana", 
"eukaryote;animal;dog", 
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly", 
"prokaryote;bacterium;ecoli")

Extract English Words from a Text in R