Select Columns Based on Multiple Strings with Dplyr Contains()

select columns based on multiple strings with dplyr contains()

You can use matches

 mtcars %>%
select(matches('m|ar')) %>%
head(2)
# mpg am gear carb
#Mazda RX4 21 1 4 4
#Mazda RX4 Wag 21 1 4 4

According to the ?select documentation

‘matches(x, ignore.case = TRUE)’: selects all variables whose
name matches the regular expression ‘x’

Though contains work with a single string

mtcars %>% 
select(contains('m'))

Dplyr select based on multiple strings in a column

To select variables that contain a and c we could do:

library(dplyr)

df %>%
select(matches("(a.*c)|(c.*a)"))
  a_b_c c_b_a
1 1 1
2 2 2
3 3 3
4 4 4

Note that var a_a_e is not selected because it doesn't contain c and var c_f_g is not selected because it doesn't contain a. Column names with two a's and two c's will not be selected either as seen with var a_a_e.

We could also use str_subset:

library(dplyr)
library(stringr)

df %>%
select(str_subset(names(df), "(a.*c)|(c.*a)"))

Data:

df <- data.frame(
a_b_c = 1:4,
a_a_e = 1:4,
c_f_g = 1:4,
c_b_a = 1:4
)

Select columns based on string match - dplyr::select

Within the dplyr world, try:

select(iris,contains("Sepal"))

See the Selection section in ?select for numerous other helpers like starts_with, ends_with, etc.

how to choose columns based on specific names of the columns in a dataframe

You can use grep/grepl to match column names by a pattern. If your dataframe is called df.

df[grepl('mean|std', names(df))]

Or in dplyr you can use select :

library(dplyr)
df %>% select(matches('mean|std'))

dplyr select column based on string match

You can construct the columns in the order that you want with outer.

order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
cols <- c(t(outer(order1, order2, paste, sep = '_')))
cols
#[1] "start_f" "start_a" "middle_f" "middle_a" "end_f" "end_a"

data[cols]
# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5

If not all combinations of order1 and order2 are present in the data we can use any_of which will select only the columns present in data without giving any error.

library(dplyr)
data %>% select(any_of(cols))

To select based on pattern in names.

order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
pattern <- c(t(outer(order1, order2, function(x, y) sprintf('^%s_%s.*', x, y))))
pattern
#[1] "^start_f.*" "^start_a.*" "^middle_f.*" "^middle_a.*" "^end_f.*" "^end_a.*"
cols <- names(data)

data[sapply(pattern, function(x) grep(x, cols))]

# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5

Filtering multiple string columns based on 2 different criteria - questions about grepl and starts_with

We can use filter with across. where we loop over the columns using c_across specifying the column name match in select_helpers (starts_with), get a logical output with grepl checking for either "C18" or (|) the number that starts with (^) 153

library(dplyr) #1.0.0
library(stringr)
df %>%
# // do a row wise grouping
rowwise() %>%
# // subset the columns that starts with 'DGN' within c_across
# // apply grepl condition on the subset
# // wrap with any for any column in a row meeting the condition
filter(any(grepl("C18|^153", c_across(starts_with("DGN")))))

Or with filter_at

df %>% 
# //apply the any_vars along with grepl in filter_at
filter_at(vars(starts_with("DGN")), any_vars(grepl("C18|^153", .)))

data

df <-  data.frame(ID = 1:3, DGN1 = c("2_C18", 32, "1532"), 
DGN2 = c("24", "C18_2", "23"))

Subsetting strings from a column if they match multiple strings in a different column

We need a group by all

library(dplyr)
df1 %>%
group_by(species) %>%
filter(all(c('warmed', 'ambient') %in% state)) %>%
ungroup

-output

# A tibble: 4 x 2
# species state
# <chr> <chr>
#1 Rufl warmed
#2 Rufl ambient
#3 Assp warmed
#4 Assp ambient

The & operation doesn't work as the elements are not present in the same location


Or using subset

subset(df1, species %in% names(which(rowSums(table(df1) > 0) == 2)))


Related Topics



Leave a reply



Submit