Filter Data Frame Rows Based on Values in Vector

Filter data frame rows based on values in vector

nzcoops is spot on with his suggestion. I posed this question in the R Chat a while back and Paul Teetor suggested defining a new function:

`%notin%` <- function(x,y) !(x %in% y)

Which can then be used as follows:

foo <- letters[1:6]

> foo[foo %notin% c("a", "c", "e")]
[1] "b" "d" "f"

Needless to say, this little gem is now in my R profile and gets used quite often.

Select rows from a data frame based on values in a vector

Have a look at ?"%in%".

dt[dt$fct %in% vc,]
   fct X
1    a 2
3    c 3
5    c 5
7    a 7
9    c 9
10   a 1
12   c 2
14   c 4

You could also use ?is.element:

dt[is.element(dt$fct, vc),]

Filter data frame matching all values of a vector

Here's another dplyr solution without ever leaving the pipe:

ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')

x <- data.frame(ID, Hour)

testVector <- c('0','2','5')

x %>%
  group_by(ID) %>%
  mutate(contains = Hour %in% testVector) %>%
  summarise(all = sum(contains)) %>%
  filter(all > 2) %>%
  select(-all) %>%
  inner_join(x)

##       ID   Hour
##   <fctr> <fctr>
## 1      A      0
## 2      A      2
## 3      A      5
## 4      A      6
## 5      A      9
## 6      B      0
## 7      B      2
## 8      B      5
## 9      B      6

filter values in a list of dataframes based on a vector, and add rows for vector values not contained in dataframes

Create a data.frame with all the rows you want
data.frame(province=vector)
Merge this with the data frame you do have, setting all.x=TRUE (so every row from point 1 is retained, and filled with NA if necessary)
merge(data.frame(province=vector), df1, all.x=TRUE)
Done!

> merge(data.frame(province=vector), df1, all.x=TRUE)
  province value value2
1    prov1    23     25
2    prov2    NA     NA
3    prov3    56     57
4    prov4    NA     NA
5    prov5    93     83
6    prov6    NA     NA

Bonus 1: you can trivially loop this with lapply
lapply(list_df, function(df) merge(data.frame(province=vector), df, all.x=TRUE))
(if you have a lot of data frames you want to apply this to, you will probably want to avoid re-building the vector data frame anonymously each time but create it as a named data frame instead)
Bonus 2: all base-r with no dependencies whatsoever
Bonus 3: you did say it doesn't matter, but the rows are in order as in vector

How to filter a table's row based on an external vector?

Use the %in% operator.

#Sample data
dat <- data.frame(patients = 1:5, treatment = letters[1:5],
  hospital = c("yyy", "yyy", "zzz", "www", "uuu"), response = rnorm(5))

#List of hospitals we want to do further analysis on
goodHosp <- c("yyy", "uuu")

You can either index directly into your data.frame object:

dat[dat$hospital %in% goodHosp ,]

or use the subset command:

subset(dat, hospital %in% goodHosp)

Filtering a data frame on a vector

You can use the %in% operator:

> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
   id  x
1   A  1
2   B  2
5   E  5
27  A 27
28  B 28
31  E 31

If your IDs are unique, you can use match():

> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
  id x
1  A 1
2  B 2
5  E 5

or make them the rownames of your dataframe and extract by row:

> rownames(df) <- df$id
> df[L, ]
  id x
A  A 1
B  B 2
E  E 5

Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.

dplyr: Filter based on a vector

This is a case of structure of dataset i.e. with data.frame, if we use [,col], it uses drop = TRUE and coerces it to vector, while for data.table or tibble, by default, it is drop = FALSE, thus returning the tibble itself with single column. The documentation can be found in ?Extract. Safe option is [[ which have the same behavior in extraction of column as a vector

vector_df1 <- df[[3]]

According to ?Extract, the default usage is

x[i, j, ... , drop = TRUE]

and it is specified as

or matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.

The documentation for tibble can be found in ?"tbl_df-class"

df[, j] returns a tibble; it does not automatically extract the column inside. df[, j, drop = FALSE] is the default. Read more in subsetting.

Subset dataframe rows based on character vector when %in% and which are not working

(Just adding my comment as an answer since it was posted before the other ones)

The problem is that in vec you have dots, whereas in df$Specimen.Label you have hyphens, so your first commands do not return anything. If you write instead

df[df$Specimen.Label %in% gsub("\\.", "-", vec),]

you obtain

#     PCC Participant.ID                    Specimen.Label
# 3  PNNL        01CO008    8cc7e656-0152-4359-8566-0581c3
# 6  PNNL        05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8  PNNL        11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL        11CO053 e1cd3d70-132b-452f-ba10-026721_D2

Another base R option is to use the function subset

subset(df, Specimen.Label %in% gsub("\\.", "-", vec))

Filtering rows based on partial matching between a data frame and a vector

We can paste the elements of 'vector' into a single string collapsed by | and usse that in grepl or str_detect to filter the rows

library(dplyr)
library(stringr)
df %>% 
   filter(str_detect(nam, str_c(vector, collapse="|")))
#           nam    aa
#1 mmu_mir-1-3p 12854
#2 mmu_mir-1-5p    36
#3 mmu-mir-3-5p  5489
#4 mmu-mir-6-3p  2563

In base R, this can be done with subset/grepl

subset(df, grepl(paste(vector, collapse= "|"), nam))

How to filter a dataframe using a preset vector in R

Use %in%:

df %>% 
  filter(code %in% x)

Filter Data Frame Rows Based on Values in Vector