Filter data table by dynamic column name
If your data is
a <- c(1:9)
b <- c(10:18)
# create a data.frame
df <- data.frame(a,b)
# or a data.table
dt <- data.table(a,b)
you can store your condition(s) in a variable x
x <- quote(a >= 3)
and filter the data.frame using dplyr
(subsetting with [] won't work)
library(dplyr)
filter(df, x)
or using data.table
as suggested by @Frank
library(data.table)
dt[eval(x),]
Filtering rows based on dynamic column count & column name in R
It may make sense to subset and then do a semi join for filtering
finalDf <- data.frame()
for(i in 1:nrow(combinationDf)){
sample <- inputDf %>%
semi_join(combinationDf %>% slice(i) %>% select(where(~.x==1)))
finalDf2 <- rbind(finalDf ,sample)
}
At each loop iteration we select all the columns that are 1 and then just join to extract the matching values from inputDf
. This will work with any number of columns. Another way of expressing this without the loop in dplyr
is
combinationDf %>%
group_by(id=1:n()) %>%
group_map(~.x %>%
select(where(~.x==1)) %>%
semi_join(inputDf, .)
) %>%
bind_rows()
This may be more readable.
Using dynamic column names in `data.table`
you should use .SDcols
(especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).
dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]
# condition var1 var2 var3
# 1: one 101.0 1001.0 10001.0
# 2: two 104.0 1004.0 10004.0
# 3: three 107.0 1007.0 10007.0
# 4: four 109.5 1009.5 10009.5
You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols
like this:
keys <- setdiff(names(dtb), "condition")
# keys = var1, var2, var3
dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]
Edit: As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition
, you could just do:
dtb[, lapply(.SD, mean), by=condition]
David's edit: (which got rejected): Read more about .SD
from this post. I find this is relevant here. Thanks @David.
Edit 2: Suppose you have a data.table
with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):
require(data.table)
set.seed(45)
dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
m <- matrix(rnorm(300*1000), ncol=300)
dt <- cbind(dt, m)
setkey(dt, "grp")
and you wanted to find the mean of the columns, say, 251:300 alone,
you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).
dt.out <- dt[, lapply(.SD, mean), by=grp]
dim(dt.out) # 15 * 301, not efficient.you can filter the
data.table
first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.dt.sub <- dt[, c(1, 251:300)]
setkey(dt.sub, "grp")
dt.out <- dt.sub[, lapply(.SD, mean), by=grp]you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)
# if you just need one or few columns
dt.out <- dt[, list(m.v251 = mean(V251)), by = grp]
So what's the best solution? The answer is .SDcols.
As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.
This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!
How can we do this?
By specifiying either the column numbers:
dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300]
dim(dt.out) # 15 * 51 (what we expect)Or alternatively by specifying the column id:
ids <- paste0("V", 251:300) # get column ids
dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids]
dim(dt.out) # 15 * 51 (what we expect)
It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.
Hope this helps.
How to filter data based on a column name dynamically in R?
As the column names are stored in a variable, we cannot directly utilise them. One way is to use the variables to subset the dataframe.
aggregate(df[item], df[x_axis_column], sum)
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
Or another option is with using formula and get
aggregate(get(item)~get(x_axis_column), df, sum)
Filter data.table with another data.table with different column names
If I'm understanding the question correctly, this is a merge of dt
with dt_score_1
with the conditions area = zone, cluster = cluster_mode
.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by @Frank
Perform a semi-join with data.table
subsetting data tables by dynamic column names
Let's look at your expression in i
:
grep(i,colnames(mm2myModuleByYear),value=TRUE)
[1] "module1997"
Therefore the expression:
grep(i,colnames(mm2myModuleByYear),value=TRUE)==mId
# [1] FALSE
would return FALSE
(of course "module1997" != 37). What you intend here is to fetch the column returned by your grep()
expression. To to that, you can use get()
from base R.
with(mm2myModuleByYear, get(grep(i,colnames(mm2myModuleByYear),value=TRUE)))
# [1] 1428 669 37 NA NA NA
In short, you're missing a get()
in your i-expression.
mm2myModuleByYear[get(grep(i,colnames(mm2myModuleByYear),value=TRUE))==mId, authId]
# [1] 2270
Filter data.table using inequalities and variable column names
OK, then,
Use get(mycol)
because you want the argument to dt[
to be the contents of the object "mycol" . I believe dt[mycol ...]
looks for a "mycol" thingie in the data.table
object itself, of which of course there is no such animal.
Filter conditions based on a list of column in data.table
Here is one option using rowSums
and .I
to extract those rows before subsetting:
cmts <- grep("^CMT_", names(dt), value=TRUE)
dt[dt[, .I[rowSums(.SD!="") > 1L], .SDcols=cmts]]
Related Topics
Why Does Withcallinghandlers Still Stops Execution
Programmatically Create Tab and Plot in Markdown
R - Identify Consecutive Sequences
Install R Packages in Azure Ml
R Dplyr Subset with Missing Columns
Adding an Image to a Datatable in R
Caret Error: "All the Accuracy Metric Values Are Missing"
R: How to Retrieve a Column Name of a Data Frame
Pre-Select Rows of a Dynamic Dt in Shiny
Generate Id for Each Group with Repeated and Missing Observations
Reshape R Data with User Entries in Rows, Collapsing for Each User
Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set
How to Use R to Create a Word Co-Occurrence Matrix
Align Points and Error Bars in Ggplot When Using 'Jitterdodge'
Manual Simulation of Markov Chain in R
Recode Multiple Columns Using Dplyr