How to filter a table's row based on an external vector?
Use the %in%
operator.
#Sample data
dat <- data.frame(patients = 1:5, treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"), response = rnorm(5))
#List of hospitals we want to do further analysis on
goodHosp <- c("yyy", "uuu")
You can either index directly into your data.frame object:
dat[dat$hospital %in% goodHosp ,]
or use the subset command:
subset(dat, hospital %in% goodHosp)
How to filter R datatable based on external column vector
A couple more options:
# extended example
DT <- rbind(DT,DT)
select <- c(select,rev(select))
expected <- c(4,8,3,1,8,6)
# create a new column with by
DT[, V1 := .SD[[select]], by = select]$V1
# or use ave
ave( seq(nrow(DT)), select, FUN = function(ii) DT[[ select[ii][1] ]][ii] )
These are both basically doing the same thing: for each value v
in select
, grab the corresponding vector, DT[[v]]
; and subset it to where select==v
.
How to Filter Data Table Rows with condition on column of Type list() in R
You can use sapply
function to check if any of the values in vals
is in Product
for each row:
vals = c("UG12210","UG10000-WISD")
dt[Period %chin% "2018-Q1" & sapply(Product, function(v) any(vals %chin% v))]
# Id Period Product
# 1: 1000797366 2018-Q1 UG10000-WISD
# 2: 1000797366 2018-Q1 NX11100,UG10000-WISD,UG12210
# 3: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 4: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 5: 1000797366 2018-Q1 UG12210
Filtering Data Table Row Vector That Lies Between 2 Numeric Vectors
Your data:
df <- read.table(header=TRUE, text='
ID AltID Crit1 Crit2 Crit3
1 1 1 5 10
1 2 3 7 15
1 3 2 6 11')
minCutoff = c(0, 5, 10)
maxCutoff = c(4, 7, 12)
TL;DR:
df[rowSums(mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11
Having a variable number of Crit
columns is easily handled with a function to apply to each in turn, and then aggregate the results. If you are already using the dplyr
package, then you already have dplyr::between
, but if not then here is an acceptable replacement:
between <- function(x, low, hi) low <= x & x <= hi
I'll walk you through the work:
isbetween <- mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)
isbetween
# Crit1 Crit2 Crit3
# [1,] TRUE TRUE TRUE
# [2,] TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE
df[grepl("Crit", colnames(df)) ]
is one way (of several) for looking at just the columns that are of interest to you;mapply
applies a function (between
, in this case) with the first value of each of the other lists/vectors. It is effectively the same as:between(df[3], minCutoff[1], maxCutoff[1])
between(df[4], minCutoff[2], maxCutoff[2])
...
Now that we have a logical matrix of individual values within their respective cutoffs, we an look at each row to check if they meet your filter requirements of 3 or more. Unfortunately, your listed expected output is not compatible with your rules, so I'll offer some alternatives:
"where any 3 columns fall outside the range", meaning if 3 or more columns are
FALSE
, then the row should be removedrowSums(!isbetween) >= 3
# [1] FALSE FALSE FALSE"where at least 3 columns fall inside the range", which is what your expected output suggests:
rowSums(isbetween) >= 3
# [1] TRUE FALSE TRUE
Regardless of which you choose, take this logical vector and subset the rows, such as
df[rowSums(isbetween) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11
(The biggest difference between Rui's answer and this is that that answer uses apply
on a data.frame
for row-wise operations, implicitly converting the involved columns into a matrix. My answer works column-wise (natural operation with frames), so no conversion is done. Other than this conversion, if the frame is not huge then the performance of row-wise versus column-wise should be roughly the same. If it is largely assymmetric (e.g., many many more rows than columns), then it might be a little faster to work column-wise. Vectorized work in R is almost always much faster than iterative.)
R select rows in dataframe by external vector as index
It is easier to filter by the gene names, if you keep them as a column,
instead of making them rownames
.
The following changes to your code will get you the result you are lookin for.
library(tidyverse)
df <-data.frame("Names" = c("TIGIT", "ABCB1", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "ABL1", "CD2", "IL12A", "PSEN2", "CD3G", "CD28", "PSEN1", "ITGA1"),"1S" = c("5", "6", "8", "99", "5", "0", "1", "3", "15", "15", "34", "62", "54", "6", "8", "9"), "1T" = c("6", "4", "6", "9", "5", "11", "33", "7", "8", "24", "34", "62", "66", "4", "78", "44"))
genes_to_select <- c("TIGIT", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "CD2", "PSEN2", "CD3G", "CD28", "PSEN1") # genes I want to select
df <-
df %>%
filter(Names %in% genes_to_select) %>%
column_to_rownames("Names") %>%
mutate(across(.fns = as.numeric)) %>%
as.matrix()
df
#> X1S X1T
#> [1,] 5 6
#> [2,] 8 6
#> [3,] 99 9
#> [4,] 5 5
#> [5,] 0 11
#> [6,] 1 33
#> [7,] 3 7
#> [8,] 15 24
#> [9,] 62 62
#> [10,] 54 66
#> [11,] 6 4
#> [12,] 8 78
Filtering a data frame on a vector
You can use the %in%
operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match()
:
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table
package.
dplyr: Filter based on a vector
This is a case of structure of dataset i.e. with data.frame
, if we use [,col]
, it uses drop = TRUE
and coerces it to vector
, while for data.table
or tibble
, by default, it is drop = FALSE
, thus returning the tibble itself with single column. The documentation can be found in ?Extract
. Safe option is [[
which have the same behavior in extraction of column as a vector
vector_df1 <- df[[3]]
According to ?Extract
, the default usage is
x[i, j, ... , drop = TRUE]
and it is specified as
or matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
The documentation for tibble
can be found in ?"tbl_df-class"
df[, j] returns a tibble; it does not automatically extract the column inside. df[, j, drop = FALSE] is the default. Read more in subsetting.
Subset rows in a data frame based on a vector of values
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index]
subsets columns for a 2d object
. If you want to subset rows and keep all columns you have to use the specificationobject[index_rows, index_columns]
, while index_cols
can be left blank, which will use all columns by default.
However, you still need to include the ,
to indicate that you want to get a subset of rows instead of a subset of columns.
subset a column in data frame based on another data frame/list
We can use %in%
to get a logical vector and subset
the rows of the 'table1' based on that.
subset(table1, gene_ID %in% accessions40$V1)
A better option would be data.table
library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]
Or use filter
from dplyr
library(dplyr)
table1 %>%
filter(gene_ID %in% accessions40$V1)
How do you filter with a Dataframe, list, vector etc. to a table in a database in R?
In general joining or merging tables requires them to share the same environment. Hence, there are three general options here:
- Load the remote table into R's local workspace
- Load the CSV table into the database and use a semi-join.
- 'Smuggle' the list of IDs in the CSV into the database
Let's consider each in turn:
Option 1
This is probably the simplest option but it requires that the remote/ODBC table is small enough to fit in R's working memory. If so, you can call local_table = collect(remote_table)
to load the database table.
Option 2
dbplyr
includes a command copy_to
(ref) that lets you copy local tables via odbc to a database/remote connection. You will need to have permission to create tables in the remote environment.
This approach makes use of the DBI package. At the time of writing v1.0.0 of DBI on CRAN has some limitations when writing to non-default schemas. So you may need to upgrade to the development version on GitHub (here).
Your code will look something like:
DBI::dbWriteTable(db_connection,
DBI::Id(schema = "schema", table = "name")),
r_table_name)
Option 3
Smuggle the list of IDs into the database via the table definition. This is the same idea as here, and works best if the list of IDs is short.
Remote tables are essentially defined by the code/query that fetches their results. Hence the list of IDs can appear in the code that defines your remote table. Consider the following example:
library(dplyr)
library(dbplyr)
data(mtcars)
list_of_ids = c(1,2,3,4)
df = tbl_lazy(mtcars, con = simulate_mssql())
df %>% filter(ID %in% list_of_ids ) %>% show_query()
show_query()
renders the code that defines the current version of the remote table. In the example above it returns the following - note that the list of IDs now appears in the code.
<SQL>
SELECT *
FROM `df`
WHERE (`ID` IN (1.0, 2.0, 3.0, 4.0))
If the list of IDs is very long, the size of this query will become a problem. Hence there is a limit on the number of IDs you can filter on using this approach (I have not tested this approach to find the limit - I seldom using the IN
clause for a list of more than 10).
Related Topics
Average Values of a Point Dataset to a Grid Dataset
How to Properly Document a S3 Method of a Generic from a Different Package, Using Roxygen
Multiply Many Columns by a Specific Other Column in R with Data.Table
Conditional Assignment of One Variable to the Value of One of Two Other Variables
Create Convex Hull Polygon from Points and Save as Shapefile
Faster Way to Subset on Rows of a Data Frame in R
How to Self Join a Data.Table on a Condition
Highlight All Connected Paths from Start to End in Sankey Graph Using R
Population Pyramid Density Plot in R
Figure Captions, References Using Knitr and Markdown to HTML
How to Get Geom_Vline to Honor Facet_Wrap
Colorize Parts of the Title in a Plot
Setting the Color for an Individual Data Point
Create a Formula in a Data.Table Environment in R
Using Lapply to Change Column Names of a List of Data Frames