How to Filter a Table's Row Based on an External Vector

How to filter a table's row based on an external vector?

Use the %in% operator.

#Sample data
dat <- data.frame(patients = 1:5, treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"), response = rnorm(5))

#List of hospitals we want to do further analysis on
goodHosp <- c("yyy", "uuu")

You can either index directly into your data.frame object:

dat[dat$hospital %in% goodHosp ,]

or use the subset command:

subset(dat, hospital %in% goodHosp)

How to filter R datatable based on external column vector

A couple more options:

# extended example
DT <- rbind(DT,DT)
select <- c(select,rev(select))
expected <- c(4,8,3,1,8,6)

# create a new column with by
DT[, V1 := .SD[[select]], by = select]$V1

# or use ave
ave( seq(nrow(DT)), select, FUN = function(ii) DT[[ select[ii][1] ]][ii] )

These are both basically doing the same thing: for each value v in select, grab the corresponding vector, DT[[v]]; and subset it to where select==v.

How to Filter Data Table Rows with condition on column of Type list() in R

You can use sapply function to check if any of the values in vals is in Product for each row:

vals = c("UG12210","UG10000-WISD")

dt[Period %chin% "2018-Q1" & sapply(Product, function(v) any(vals %chin% v))]

# Id Period Product
# 1: 1000797366 2018-Q1 UG10000-WISD
# 2: 1000797366 2018-Q1 NX11100,UG10000-WISD,UG12210
# 3: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 4: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 5: 1000797366 2018-Q1 UG12210

Filtering Data Table Row Vector That Lies Between 2 Numeric Vectors

Your data:

df <- read.table(header=TRUE, text='
ID AltID Crit1 Crit2 Crit3
1 1 1 5 10
1 2 3 7 15
1 3 2 6 11')
minCutoff = c(0, 5, 10)
maxCutoff = c(4, 7, 12)

TL;DR:

df[rowSums(mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11

Having a variable number of Crit columns is easily handled with a function to apply to each in turn, and then aggregate the results. If you are already using the dplyr package, then you already have dplyr::between, but if not then here is an acceptable replacement:

between <- function(x, low, hi) low <= x & x <= hi

I'll walk you through the work:

isbetween <- mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)
isbetween
# Crit1 Crit2 Crit3
# [1,] TRUE TRUE TRUE
# [2,] TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE
  • df[grepl("Crit", colnames(df)) ] is one way (of several) for looking at just the columns that are of interest to you;

  • mapply applies a function (between, in this case) with the first value of each of the other lists/vectors. It is effectively the same as:

    between(df[3], minCutoff[1], maxCutoff[1])
    between(df[4], minCutoff[2], maxCutoff[2])
    ...

Now that we have a logical matrix of individual values within their respective cutoffs, we an look at each row to check if they meet your filter requirements of 3 or more. Unfortunately, your listed expected output is not compatible with your rules, so I'll offer some alternatives:

  • "where any 3 columns fall outside the range", meaning if 3 or more columns are FALSE, then the row should be removed

    rowSums(!isbetween) >= 3
    # [1] FALSE FALSE FALSE
  • "where at least 3 columns fall inside the range", which is what your expected output suggests:

    rowSums(isbetween) >= 3
    # [1] TRUE FALSE TRUE

Regardless of which you choose, take this logical vector and subset the rows, such as

df[rowSums(isbetween) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11

(The biggest difference between Rui's answer and this is that that answer uses apply on a data.frame for row-wise operations, implicitly converting the involved columns into a matrix. My answer works column-wise (natural operation with frames), so no conversion is done. Other than this conversion, if the frame is not huge then the performance of row-wise versus column-wise should be roughly the same. If it is largely assymmetric (e.g., many many more rows than columns), then it might be a little faster to work column-wise. Vectorized work in R is almost always much faster than iterative.)

R select rows in dataframe by external vector as index

It is easier to filter by the gene names, if you keep them as a column,
instead of making them rownames.

The following changes to your code will get you the result you are lookin for.

library(tidyverse)

df <-data.frame("Names" = c("TIGIT", "ABCB1", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "ABL1", "CD2", "IL12A", "PSEN2", "CD3G", "CD28", "PSEN1", "ITGA1"),"1S" = c("5", "6", "8", "99", "5", "0", "1", "3", "15", "15", "34", "62", "54", "6", "8", "9"), "1T" = c("6", "4", "6", "9", "5", "11", "33", "7", "8", "24", "34", "62", "66", "4", "78", "44"))

genes_to_select <- c("TIGIT", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "CD2", "PSEN2", "CD3G", "CD28", "PSEN1") # genes I want to select

df <-
df %>%
filter(Names %in% genes_to_select) %>%
column_to_rownames("Names") %>%
mutate(across(.fns = as.numeric)) %>%
as.matrix()

df
#> X1S X1T
#> [1,] 5 6
#> [2,] 8 6
#> [3,] 99 9
#> [4,] 5 5
#> [5,] 0 11
#> [6,] 1 33
#> [7,] 3 7
#> [8,] 15 24
#> [9,] 62 62
#> [10,] 54 66
#> [11,] 6 4
#> [12,] 8 78

Filtering a data frame on a vector

You can use the %in% operator:

> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31

If your IDs are unique, you can use match():

> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5

or make them the rownames of your dataframe and extract by row:

> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5

Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.

dplyr: Filter based on a vector

This is a case of structure of dataset i.e. with data.frame, if we use [,col], it uses drop = TRUE and coerces it to vector, while for data.table or tibble, by default, it is drop = FALSE, thus returning the tibble itself with single column. The documentation can be found in ?Extract. Safe option is [[ which have the same behavior in extraction of column as a vector

vector_df1 <- df[[3]]

According to ?Extract, the default usage is

x[i, j, ... , drop = TRUE]

and it is specified as

or matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.

The documentation for tibble can be found in ?"tbl_df-class"

df[, j] returns a tibble; it does not automatically extract the column inside. df[, j, drop = FALSE] is the default. Read more in subsetting.

Subset rows in a data frame based on a vector of values

This will give you what you want:

eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]

The error in your second attempt is because you forgot the ,

In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.

However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.

subset a column in data frame based on another data frame/list

We can use %in% to get a logical vector and subset the rows of the 'table1' based on that.

subset(table1, gene_ID %in% accessions40$V1)

A better option would be data.table

library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]

Or use filter from dplyr

library(dplyr)
table1 %>%
filter(gene_ID %in% accessions40$V1)

How do you filter with a Dataframe, list, vector etc. to a table in a database in R?

In general joining or merging tables requires them to share the same environment. Hence, there are three general options here:

  1. Load the remote table into R's local workspace
  2. Load the CSV table into the database and use a semi-join.
  3. 'Smuggle' the list of IDs in the CSV into the database

Let's consider each in turn:

Option 1

This is probably the simplest option but it requires that the remote/ODBC table is small enough to fit in R's working memory. If so, you can call local_table = collect(remote_table) to load the database table.

Option 2

dbplyr includes a command copy_to (ref) that lets you copy local tables via odbc to a database/remote connection. You will need to have permission to create tables in the remote environment.

This approach makes use of the DBI package. At the time of writing v1.0.0 of DBI on CRAN has some limitations when writing to non-default schemas. So you may need to upgrade to the development version on GitHub (here).

Your code will look something like:

DBI::dbWriteTable(db_connection,
DBI::Id(schema = "schema", table = "name")),
r_table_name)

Option 3

Smuggle the list of IDs into the database via the table definition. This is the same idea as here, and works best if the list of IDs is short.

Remote tables are essentially defined by the code/query that fetches their results. Hence the list of IDs can appear in the code that defines your remote table. Consider the following example:

library(dplyr)
library(dbplyr)
data(mtcars)

list_of_ids = c(1,2,3,4)
df = tbl_lazy(mtcars, con = simulate_mssql())
df %>% filter(ID %in% list_of_ids ) %>% show_query()

show_query() renders the code that defines the current version of the remote table. In the example above it returns the following - note that the list of IDs now appears in the code.

<SQL>
SELECT *
FROM `df`
WHERE (`ID` IN (1.0, 2.0, 3.0, 4.0))

If the list of IDs is very long, the size of this query will become a problem. Hence there is a limit on the number of IDs you can filter on using this approach (I have not tested this approach to find the limit - I seldom using the IN clause for a list of more than 10).



Related Topics



Leave a reply



Submit