Subset Data to Contain Only Columns Whose Names Match a Condition

Subset data to contain only columns whose names match a condition

Try grepl on the names of your data.frame. grepl matches a regular expression to a target and returns TRUE if a match is found and FALSE otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.

Example

#  Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )

# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290

# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444

# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE

To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )

# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1

Subset data to contain only columns whose names match multiple condition using data.table

You can select multiple columns that match certain patterns in data.table using patterns in its .SDcols argument:

# turn df into data.table
setDT(df)

# select columns that contain ABD or XYZ
df[, .SD, .SDcols=patterns("ABC|XYZ")]

# or
df[, grep("ABC|XYZ", names(df)), with=FALSE]

subsetting rows and columns at the same time

cols = grep("ABC|XYZ",  names(df))

df[rowSums(df[, ..cols]>0)>0, .SD, .SDcols=cols]

select data that only columns whose names match a condition with a fixed column

Try this:

output$fault_template <- renderDataTable({
fau <- fau[, c(1, grep(input$su, names(fau))) ]
datatable(fau[,-1:-1],class = 'cell-border stripe')

})

By changing grepl to grep you'll get column indexes. Assuming that column A has index 1, then add it to the selection with c(1, ...

If column A has a column index that may change, try:

c(grep("A", names(fau)), grep(input$su, names(fau)))

input Pattern

If the input$su is a character like: "ASD GHG BVG JJJ" you need to convert it to a useful regex

Try changing:

grep(input$su, names(fau))

to

grep( gsub(" +", "|", input$su), names(fau))

This results in a pattern: "ASD|GHG|BVG|JJJ". I am assuming that each three letter group is a column name

Subset data based on partial match of column names

You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names.

d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")

d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

This last part just illustrates another way to do this with other string processing functions from stringr

library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

ADD per OPs comment

d[grepl("ring[12()]", names(d))]

to get either of the substrings ring1() or ring2() from the names vector

Subset data table by all unique entries in columns whose name contain a certain substring, fill with NA for other entries

We may get the unique elements and then replace the duplicated with NA

library(data.table)
dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
lapply(.SD, \(x) replace(x, duplicated(x), NA))]

-output

    a_ID   b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>

Or another option with unique

unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>

Or another option is to block the code, check for the lengths after the unique step and append NA to fix the length

dt[, {lst1 <- lapply(.SD, unique)
mx <- max(lengths(lst1))
lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>

We may also use collapse - select the columns (gvr), get the unique rows (funique), loop over the columns with dapply, replace the duplicates with NA

library(collapse)
dapply(funique(gvr(dt, "_ID$")), MARGIN = 2,
FUN = \(x) replace(x, duplicated(x), NA))
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>

Subset column names with specific string

To specify "ABC_" followed by a one or more digits (i.e. \\d+ or [0-9]+), you can use

df1 <- df[ , grepl("ABC_\\d+", names( df ), perl = TRUE ) ]
# df1 <- df[ , grepl("ABC_[0-9]+", names( df ), perl = TRUE ) ] # another option

To force the column names to start with "ABC_" you can add ^ to the regex to match only when "ABC_\d+" occurs at the start of the string as opposed to occurring anywhere within it.

df1 <- df[ , grepl("^ABC_\\d+", names( df ), perl = TRUE ) ]

If dplyr is more to your liking, you might try

library(dplyr)
select(df, matches("^ABC_\\d+"))

Is it possible to select columns in r based on any value in the column?

We can pass a function in select within where - check whether the column is numeric and if that is numeric, check whether there are any value equal to 9. In addition can change the any(.x ==9) to 9 %in% .x.

library(dplyr)
df %>%
select(where(~is.numeric(.x) && any(.x == 9)))

-output

 apple banana
1 1 9
2 4 9
3 6 4
4 8 8
5 9 1
6 9 3
7 2 6
8 4 7
9 7 5
10 4 9

Reduce columns that whose names matches a pattern

The reason is that the grep returns only the column names with value = TRUE, we need the value of the columns, Use .SD to subset the columns from the column names

library(data.table)
testing[,`:=` (
"Total 1" = Reduce(`+`, .SD[, grep("number_1|number_2", names(.SD),
value = TRUE), with = FALSE]),
"Total 2" = Reduce(`+`, .SD[, grep("number_3|number_4", names(.SD),
value = TRUE), with = FALSE]))]

-output

> testing
first_column number_1 number_2 number_3 number_4 Total 1 Total 2
<char> <num> <num> <int> <int> <num> <int>
1: Alpha 1 11 2 12 12 14
2: Beta 2 12 3 13 14 16
3: Charlie 3 13 4 14 16 18
4: Tango 4 14 5 15 18 20
5: Alpha, Beta,Alpha 5 15 6 16 20 22
6: Alpha,Beta,Charlie 6 16 7 17 22 24
7: Tango,Tango,Tango,Tango 7 17 8 18 24 26
8: Tango,Tango,Tango, Tango 8 18 9 19 26 28
9: Tango,Tango,Tango, Tango , Alpha,Beta,Charlie, Alpha, Alpha ,Alpha 9 19 10 20 28 30

If there are multiple sets, we may also create a named list , Filter the list elements based on the occurence of names

lst_names <- list(c("number_1", "number_2"), 
c("number_3", "number_4"),
c("number_5", "number_6"))
names(lst_names) <- paste("Total", seq_along(lst_names))
lst_names_sub <- Filter(length, lapply(lst_names, function(x)
intersect(x, names(testing))))
testing[, names(lst_names_sub) := lapply(lst_names_sub, function(x)
Reduce(`+`, .SD[, x, with = FALSE]))]

Subset Columns based on partial matching of column names in the same data frame

You could try:

v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])

Or using map() + select_()

library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))

Which gives:

#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Should you want to make it into a function:

checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}

Then simply use:

checkExpression(eatable, 5)


Related Topics



Leave a reply



Submit