How to Select Variables in an R Dataframe Whose Names Contain a Particular String

How do I select variables in an R dataframe whose names contain a particular string?

If you just want the variable names:

grep("^[Bb]", names(df), value=TRUE)

grep("3", names(df), value=TRUE)

If you are wanting to select those columns, then either

df[,grep("^[Bb]", names(df), value=TRUE)]
df[,grep("^[Bb]", names(df))]

The first uses selecting by name, the second uses selecting by a set of column numbers.

Subset data to contain only columns whose names match a condition

Try grepl on the names of your data.frame. grepl matches a regular expression to a target and returns TRUE if a match is found and FALSE otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.

Example

#  Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )

# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290

# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444

# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE

To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )

# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1

Extract Variable Names whose Values contain a specific string (R)

If every value in Mark1 and Mark2 contains a % we can check only the first row:

colnames(df)[grepl('%', df[1,])]
[1] "Mark1" "Mark2"

Otherwise, you can use apply with MARGIN = 2 to apply this function to each column and return a named logical vector:

apply(df, 2, function(x) any(grepl('%', x)))
Name Mark1 Mark2 Mark3
FALSE TRUE TRUE FALSE

If you just want the variable names, use this logical vector to subset colnames(df):

colnames(df)[apply(df, 2, function(x) any(grepl('%', x)))]
[1] "Mark1" "Mark2"

Using a string from a list to select a column in R

Try this:

list<-list("Var1", "Var2", "Var3")
df1 <- data.frame("Var1" = 1:2, "Var2" = c(21,15), "Var3" = c(10,9))
df2<- data.frame("Var1" = 1, "Var2" = 16, "Var3" = 8)
#Sum
df1$Var4<- df1[,list[[1]]]+df2[,list[[1]]]

Var1 Var2 Var3 Var4
1 1 21 10 2
2 2 15 9 3

Search dataframe for columns with values that contains certain string and output new dataframe

I actually used the idea you had and just used a pivot, or I suppose gather() from tidyr. I have three steps, first step is I converted any factor columns to character (At least for me it will throw out a warning otherwise). My second step was to gather all columns except PATIENT_ID and EVENT_NAME. Then the third step is to filter to only the rows that have pdf or jpg in it. I'm not sure if this is precisely what you need but it might work:

library(tidyr)
library(dplyr)
mydata%>%
mutate_if(is.factor, as.character)%>%
gather("var_name", "file_name", -PATIENT_ID,-EVENT_NAME)%>%
filter(grepl("pdf|jpg", file_name))

Best of luck to you, I hope this helps!

Select columns based on string match - dplyr::select

Within the dplyr world, try:

select(iris,contains("Sepal"))

See the Selection section in ?select for numerous other helpers like starts_with, ends_with, etc.



Related Topics



Leave a reply



Submit