Subset data to contain only columns whose names match a condition
Try grepl
on the names of your data.frame
. grepl
matches a regular expression to a target and returns TRUE
if a match is found and FALSE
otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.
Example
# Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290
# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444
# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE
To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...
set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )
# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0
df1 <- df[ , grepl( "ABC" , names( df ) ) ]
ind <- apply( df1 , 1 , function(x) any( x > 0 ) )
df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1
Subset data to contain only columns whose names match multiple condition using data.table
You can select multiple columns that match certain patterns in data.table using patterns
in its .SDcols
argument:
# turn df into data.table
setDT(df)
# select columns that contain ABD or XYZ
df[, .SD, .SDcols=patterns("ABC|XYZ")]
# or
df[, grep("ABC|XYZ", names(df)), with=FALSE]
subsetting rows and columns at the same time
cols = grep("ABC|XYZ", names(df))
df[rowSums(df[, ..cols]>0)>0, .SD, .SDcols=cols]
select data that only columns whose names match a condition with a fixed column
Try this:
output$fault_template <- renderDataTable({
fau <- fau[, c(1, grep(input$su, names(fau))) ]
datatable(fau[,-1:-1],class = 'cell-border stripe')
})
By changing grepl
to grep
you'll get column indexes. Assuming that column A has index 1, then add it to the selection with c(1, ...
If column A has a column index that may change, try:
c(grep("A", names(fau)), grep(input$su, names(fau)))
input Pattern
If the input$su
is a character like: "ASD GHG BVG JJJ" you need to convert it to a useful regex
Try changing:
grep(input$su, names(fau))
to
grep( gsub(" +", "|", input$su), names(fau))
This results in a pattern: "ASD|GHG|BVG|JJJ"
. I am assuming that each three letter group is a column name
Subset data based on partial match of column names
You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]]
as our regular expression. This will find all the strings with punctuation symbols in the column names.
d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")
d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
This last part just illustrates another way to do this with other string processing functions from stringr
library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
ADD per OPs comment
d[grepl("ring[12()]", names(d))]
to get either of the substrings ring1()
or ring2()
from the names vector
Subset data table by all unique entries in columns whose name contain a certain substring, fill with NA for other entries
We may get the unique
elements and then replace
the duplicated
with NA
library(data.table)
dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
lapply(.SD, \(x) replace(x, duplicated(x), NA))]
-output
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
Or another option with unique
unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
Or another option is to block the code, check for the length
s after the unique
step and append NA
to fix the length
dt[, {lst1 <- lapply(.SD, unique)
mx <- max(lengths(lst1))
lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
We may also use collapse
- select the columns (gvr
), get the unique rows (funique
), loop over the columns with dapply
, replace
the duplicates with NA
library(collapse)
dapply(funique(gvr(dt, "_ID$")), MARGIN = 2,
FUN = \(x) replace(x, duplicated(x), NA))
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
Subset column names with specific string
To specify "ABC_" followed by a one or more digits (i.e. \\d+
or [0-9]+
), you can use
df1 <- df[ , grepl("ABC_\\d+", names( df ), perl = TRUE ) ]
# df1 <- df[ , grepl("ABC_[0-9]+", names( df ), perl = TRUE ) ] # another option
To force the column names to start with "ABC_" you can add ^
to the regex to match only when "ABC_\d+" occurs at the start of the string as opposed to occurring anywhere within it.
df1 <- df[ , grepl("^ABC_\\d+", names( df ), perl = TRUE ) ]
If dplyr
is more to your liking, you might try
library(dplyr)
select(df, matches("^ABC_\\d+"))
Is it possible to select columns in r based on any value in the column?
We can pass a function in select
within where
- check whether the column is numeric and if that is numeric, check whether there are any
value equal to 9. In addition can change the any(.x ==9)
to 9 %in% .x
.
library(dplyr)
df %>%
select(where(~is.numeric(.x) && any(.x == 9)))
-output
apple banana
1 1 9
2 4 9
3 6 4
4 8 8
5 9 1
6 9 3
7 2 6
8 4 7
9 7 5
10 4 9
Reduce columns that whose names matches a pattern
The reason is that the grep
returns only the column names with value = TRUE
, we need the value of the columns, Use .SD
to subset the columns from the column names
library(data.table)
testing[,`:=` (
"Total 1" = Reduce(`+`, .SD[, grep("number_1|number_2", names(.SD),
value = TRUE), with = FALSE]),
"Total 2" = Reduce(`+`, .SD[, grep("number_3|number_4", names(.SD),
value = TRUE), with = FALSE]))]
-output
> testing
first_column number_1 number_2 number_3 number_4 Total 1 Total 2
<char> <num> <num> <int> <int> <num> <int>
1: Alpha 1 11 2 12 12 14
2: Beta 2 12 3 13 14 16
3: Charlie 3 13 4 14 16 18
4: Tango 4 14 5 15 18 20
5: Alpha, Beta,Alpha 5 15 6 16 20 22
6: Alpha,Beta,Charlie 6 16 7 17 22 24
7: Tango,Tango,Tango,Tango 7 17 8 18 24 26
8: Tango,Tango,Tango, Tango 8 18 9 19 26 28
9: Tango,Tango,Tango, Tango , Alpha,Beta,Charlie, Alpha, Alpha ,Alpha 9 19 10 20 28 30
If there are multiple sets, we may also create a named list
, Filter
the list
elements based on the occurence of names
lst_names <- list(c("number_1", "number_2"),
c("number_3", "number_4"),
c("number_5", "number_6"))
names(lst_names) <- paste("Total", seq_along(lst_names))
lst_names_sub <- Filter(length, lapply(lst_names, function(x)
intersect(x, names(testing))))
testing[, names(lst_names_sub) := lapply(lst_names_sub, function(x)
Reduce(`+`, .SD[, x, with = FALSE]))]
Subset Columns based on partial matching of column names in the same data frame
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map()
+ select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
Related Topics
Overlay Histogram With Density Curve
How to Count Runs in a Sequence
Unlist Data Frame Column Preserving Information from Other Column
How to Number/Label Data-Table by Group-Number from Group_By
How to Create an R Function Programmatically
How to Use Reference Variables by Character String in a Formula
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Read All Files in Directory and Apply Multiple Functions to Each Data Frame
Offline Install of R Package and Dependencies
Sum Values in a Rolling/Sliding Window
Identify Groups of Linked Episodes Which Chain Together
Scatterplot With Too Many Points
A Similar Function to R'S Rep in Matlab
Unordered Combinations of All Lengths
Better Explanation of When to Use Imports/Depends