Filtering Observations in Dplyr in Combination with Grepl

Filtering observations in dplyr in combination with grepl

I didn't understand your second regex, but this more basic regex seems to do the trick:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
fruit group
1 apple A
2 orange B
3 banxana A
4 appxxle B

And I assume you know this, but you don't have to use dplyr here at all:

df1[!grepl("^x|xx$", df1$fruit), ]
###
fruit group
1 apple A
2 orange B
7 banxana A
8 appxxle B

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

using grepl * to return NA values in R dplyr

I ended up creating a custom function to do this:

greplna <- function(data, reg="*", var="Discount code"){
if(reg == "*"){
tmp <- grepl("*", as.list(data[var])[[1]]) | is.na(as.list(data[var])[[1]])
}else{
tmp <- grepl(reg, as.list(data[var])[[1]])
}
return(tmp)
}

You can then use this in a dplyr statement:

df %>% filter(greplna(., search, "Discount code"))

but don't use it after a group, as the . gets the whole dataset, not the grouped datasets

Filtering according to combination of matching data across variables in R

You can filter by the Group column after this,

df <-as.data.frame(df)

df$v <- sapply(seq(df[,1]),function(x)
paste(sort(c(df[x,1],df[x,2])),collapse=""))
l <- data.frame(v=unique(df$v),
Group=paste0("Group",seq(unique(df$v))))
df <- merge(df,l,by="v")[,-1]

df

Word1 Word2 distance speaker session Group
1 WordA WordX 1.40 JB 1 Group1
2 WordX WordA 0.23 JB 1 Group1
3 WordB WordY 2.10 JB 1 Group2
4 WordY WordB 2.30 JB 1 Group2
5 WordC WordZ 4.70 JB 1 Group3
6 WordZ WordC 0.51 JB 1 Group3

Conditional filtering using grepl and relative row position in group

Try this:

library(dplyr)

Dataset %>%
group_by(Journal_ref, Journal_type) %>%
summarise(Journal_value = last(Journal_value)) %>%
ungroup() %>% group_by(Journal_ref) %>%
filter(!(n() > 1 & Journal_type == "Rev"))

Output:

  Journal_ref Journal_type Journal_value
<fct> <fct> <dbl>
1 1111 Adj 90
2 2222 Adj 12000
3 3333 Rev 500
4 4444 Adj 2500

filtering some strings but some of them not! with grepl

The OP has requested to filter out 'AxxBy' strings but wants to keep string 'AxxByy' (where 'x' and 'y' denote digits.

Often it is easier to specify what to keep than what to remove. To keep strings which obey the pattern 'AxxByy' the regular expression

"^A\\d{2}B\\d{2}$"

can be used where ^ denotes the begin of the string, \\d{2} a sequence of exactly two digits, and $ the end of the string. A and B stand for themselves.

With this regular expression, dplyr, and grepl() can be used to filter the input data frame DF:

library(dplyr)
#which rows are kept?
kept <- DF %>%
+ filter(grepl("^A\\d{2}B\\d{2}$", pair))
kept
# pair
#1 A10B33
#2 A11B44

# which rows are removed?
removed <- DF %>%
+ filter(!grepl("^A\\d{2}B\\d{2}$", pair))
removed
# pair
#1 A1B2
#2 A2B3
#3 A3B4
#4 A4B22
#5 AB
#6 A
#7 B
#8 A1
#9 A12
#10 B1
#11 B12
#12 AA12B34
#13 A12BB34

Note that I've added some edge cases for demonstration.


BTW: dplyr is not required if only the vector pair needs to be filtered. So, in base R the alternative expressions

pair[grepl("^A\\d{2}B\\d{2}$", pair)]
grep("^A\\d{2}B\\d{2}$", pair, value = TRUE)

both return the strings to keep:

[1] "A10B33" "A11B44"

while

pair[!grepl("^A\\d{2}B\\d{2}$", pair)]

returns the removed strings:

 [1] "A1B2"    "A2B3"    "A3B4"    "A4B22"   "AB"      "A"       "B"       "A1"     
[9] "A12" "B1" "B12" "AA12B34" "A12BB34"

Data

As given by the OP but with some edge cases appended:

# create vector of test patterns using paste0() instead of paste(..., sep = "")
pair <- paste0("A", c(1:4, 10, 11), "B", c(2, 3, 4, 22, 33, 44))
# alternatvely use sprintf()
pair <- sprintf("A%iB%i", c(1:4, 10, 11), c(2, 3, 4, 22, 33, 44))
# add some edge cases
pair <- append(pair, c("AB", "A", "B", "A1", "A12", "B1", "B12", "AA12B34", "A12BB34"))
# create data frame
DF <- data.frame(pair)
DF
# pair
#1 A1B2
#2 A2B3
#3 A3B4
#4 A4B22
#5 A10B33
#6 A11B44
#7 AB
#8 A
#9 B
#10 A1
#11 A12
#12 B1
#13 B12
#14 AA12B34
#15 A12BB34

Filtering multiple string columns based on 2 different criteria - questions about grepl and starts_with

We can use filter with across. where we loop over the columns using c_across specifying the column name match in select_helpers (starts_with), get a logical output with grepl checking for either "C18" or (|) the number that starts with (^) 153

library(dplyr) #1.0.0
library(stringr)
df %>%
# // do a row wise grouping
rowwise() %>%
# // subset the columns that starts with 'DGN' within c_across
# // apply grepl condition on the subset
# // wrap with any for any column in a row meeting the condition
filter(any(grepl("C18|^153", c_across(starts_with("DGN")))))

Or with filter_at

df %>% 
# //apply the any_vars along with grepl in filter_at
filter_at(vars(starts_with("DGN")), any_vars(grepl("C18|^153", .)))

data

df <-  data.frame(ID = 1:3, DGN1 = c("2_C18", 32, "1532"), 
DGN2 = c("24", "C18_2", "23"))

Find a specific string with grepl across all columns in R dplyr

Use if_any to match a row if any of the column (i.e. at least one among all) matches the pattern. With if_all, every column would have to match the pattern.

mpg |> 
filter(if_any(.cols = everything(), ~ grepl("audi", .)))

dplyr slice ifelse grepl filter in r: unexpected outcome

After grouping by 'ID', filter those having either all elements in 'Commnets' have substring 'Audited' or | all 'Unaudited' or else return the first 'Audited'

library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(ID,Commnets,desc(Date)) %>%
group_by(ID = trimws(ID)) %>%
mutate(flag = all(grepl('\\bAudited',
Commnets))|all(grepl('\\bUnaudited', Commnets))) %>%
filter(flag| (!flag & grepl('\\bAudited', Commnets))) %>%
filter(if(all(!flag)) row_number() == 1 else TRUE) %>%
ungroup %>%
select(-flag)
# A tibble: 7 x 4
# ID rating Commnets Date
# <chr> <chr> <chr> <date>
#1 H2 D Audited 2018-11-10
#2 H3 C+ Unaudited 2018-10-02
#3 H1 C Audited 2018-12-10
#4 H2 C Audited 2018-11-10
#5 H3 C+ Unaudited 2018-10-02
#6 H3 C Unaudited Co 2018-10-10
#7 H4 C Audited 2020-09-03

Or if we wanted to keep all the 'Audited', just remove the second filter

df %>%  
mutate(Date = as.Date(Date)) %>%
arrange(ID,Commnets,desc(Date)) %>%
group_by(ID = trimws(ID)) %>%
mutate(flag = all(grepl('\\bAudited', Commnets))|all(grepl('\\bUnaudited', Commnets))) %>%
filter(flag| (!flag & grepl('\\bAudited', Commnets))) %>%
ungroup %>%
select(-flag)
# A tibble: 8 x 4
# ID rating Commnets Date
# <chr> <chr> <chr> <date>
#1 H2 D Audited 2018-11-10
#2 H3 C+ Unaudited 2018-10-02
#3 H1 C Audited 2018-12-10
#4 H1 C Audited Co 2018-12-10
#5 H2 C Audited 2018-11-10
#6 H3 C+ Unaudited 2018-10-02
#7 H3 C Unaudited Co 2018-10-10
#8 H4 C Audited 2020-09-03


Related Topics



Leave a reply



Submit