Remove words in one column present in another column in R
One solution is to use mapply
along with strsplit
.
The trick is to split df$A
in separate words and collapse those words separated by |
and then use it as pattern
in gsub
to replace with ""
.
lst <- strsplit(df$A, split = " ")
df$B <- mapply(function(x,y){gsub(paste0(x,collapse = "|"), "",df$B[y])},lst,1:length(lst))
df
# A B
# 1 John Smith is a very highly smart guy
# 2 Red Shirt We tried the tea but didn't enjoy it at all
# 3 Family values are better is very important as it gives you
Another option is as:
df$B <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",df$A),df$B)
Data:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- data.frame(A, B, stringsAsFactors = FALSE)
How can I remove parts of string based on other column in R?
Replace empty patterns with ^$
dt$ToRemove[dt$ToRemove == ''] <- '^$'
and then use stringr::str_remove
which is vectorised.
dt$result <- stringr::str_remove(dt$SomeText, dt$ToRemove)
dt
# SomeText ToRemove result
#1 ABCDEF A BCDEF
#2 ABCDEF CDE ABF
#3 ABCDEF ^$ ABCDEF
Matching a exact word from a column to a string in another column and remove the matching word from string of other column
You can use str_extract
and gsub
. I also added a scenario in row 4 where column 1 value doesn't match with column 2. You can find the final output in column 3.
library(stringr)
library(dplyr)
col1 <- c("STELLARN714WPUR", "STELLARN714WRED", "STELLARN814WRED", "AB")
col2 <- c("STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR", "STELLARN594WRED,STELLARN814WRED,STELLARN714WRED", "STELLARN594WRED,STELLARN714WRED,STELLARN814WRED", "STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR")
df <- data.frame(column1 = col1, Column2 = col2, stringsAsFactors = FALSE)
df
column1 Column2
1 STELLARN714WPUR STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR
2 STELLARN714WRED STELLARN594WRED,STELLARN814WRED,STELLARN714WRED
3 STELLARN814WRED STELLARN594WRED,STELLARN714WRED,STELLARN814WRED
4 AB STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR
df %>%
mutate(match_val = str_extract(Column2, column1),
Column3 = ifelse(is.na(match_val), Column2,
str_replace(Column2, paste0(match_val,",|, ",match_val), "")))
column1 Column2 match_val
1 STELLARN714WPUR STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR STELLARN714WPUR
2 STELLARN714WRED STELLARN594WRED,STELLARN814WRED,STELLARN714WRED STELLARN714WRED
3 STELLARN814WRED STELLARN594WRED,STELLARN714WRED,STELLARN814WRED STELLARN814WRED
4 AB STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR <NA>
Column3
1 STELLARN594WPUR,STELLARN814WPUR
2 STELLARN594WRED,STELLARN814WRED,STELLARN714WRED
3 STELLARN594WRED,STELLARN714WRED,STELLARN814WRED
4 STELLARN594WPUR,STELLARN714WPUR,STELLARN814WPUR
How to remove all unspecified specified words from dataframe column in R
Here is another base R option:
df$bio <- sapply(lapply(strsplit(df$bio, "\\s"), intersect, termstokeep),
paste, collapse = " ")
Output
account bio
1 38374 love much life proud liberal
2 45673 can just get
3 94928 conserv christian mom proud
4 11204 women
5 37465 former state coach now proud
Data (thanks @RuiBarradas!)
df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
), bio = c("i love candy as much as life itself proud liberal",
"can all just get along", "conserv christian mom and proud pro trump veteran maga",
"professor of women and gender studies at wesleyan university blacklivesmatter",
"former ohio state football coach now a proud papa to seven grandchildren"
)), class = "data.frame", row.names = c(NA, -5L))
termstokeep <- c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
R gsub remove words in column y from words in column x
Normally gsub
takes three arguments 1) pattern, 2) replacement and 3) vector to replace values.
The pattern must be a single string. And the same for the replacement. The only part of the function that is open to multiple values is the vector. We call it vectorized because of this.
gsub(df$x, "", df$y) #doesn't work because 'df$x' isn't one string
The pattern argument is not vectorized, but we can use mapply
to complete the task.
mapply and gsub (bffs)
x = c("a","b","c")
y = c("asometext", "some, a b text", "c a text")
repl = ""
#We do
mapply(gsub, x, repl, y)
#On the inside
gsub(x[[1]], repl[[1]], y[[1]])
gsub(x[[2]], repl[[2]], y[[2]])
gsub(x[[3]], repl[[3]], y[[3]])
You may be asking, but I only have one repl
, how does repl[[2]]
and repl[[3]]
work? The function noticed that for us and repeated 'repl' until it equaled the length of the others.
Remove certain words in string from column in dataframe in R
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
Delete a Value if it is not contained in another Column, in R
Here are a few ways
Data$To[!Data$To %in% Data$From] <- NA
Data$To <- ifelse(Data$To %in% Data$From, Data$To, NA)
Data <- within(Data, To[!To %in% From] <- NA)
Data <- transform(Data, To = ifelse(To %in% From, To, NA))
Related Topics
Print R-Squared for All of the Models Fit with Lmlist
Ggplot2 and Geom_Density: How to Remove Baseline
How to Reverse Legend (Labels and Color) So High Value Starts at Bottom
How to Apply a Gradient Fill to a Geom_Rect Object in Ggplot2
How to Access the Name of the Variable Assigned to the Result of a Function Within the Function
How to Add Axis Text in This Negative and Positive Bars Differently Using Ggplot2
How to Create an Infix %Between% Operator
Change a Column from Birth Date to Age in R
Extract Name of Data.Frame in R as Character
R Output Without [1], How to Nicely Format
How to Create a Variable of Rownames
Navlistpanel: Make Tabs Sequentially Active in Shiny App
R - Cumulative Sum by Condition
Consistent Factor Levels for Same Value Over Different Datasets
Extracting Orthogonal Polynomial Coefficients from R's Poly() Function