Filter Data Frame Rows Based on Values in Vector

Filter data frame rows based on values in vector

nzcoops is spot on with his suggestion. I posed this question in the R Chat a while back and Paul Teetor suggested defining a new function:

`%notin%` <- function(x,y) !(x %in% y) 

Which can then be used as follows:

foo <- letters[1:6]

> foo[foo %notin% c("a", "c", "e")]
[1] "b" "d" "f"

Needless to say, this little gem is now in my R profile and gets used quite often.

Select rows from a data frame based on values in a vector

Have a look at ?"%in%".

dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4

You could also use ?is.element:

dt[is.element(dt$fct, vc),]

Filter data frame matching all values of a vector

Here's another dplyr solution without ever leaving the pipe:

ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')

x <- data.frame(ID, Hour)

testVector <- c('0','2','5')

x %>%
group_by(ID) %>%
mutate(contains = Hour %in% testVector) %>%
summarise(all = sum(contains)) %>%
filter(all > 2) %>%
select(-all) %>%
inner_join(x)

## ID Hour
## <fctr> <fctr>
## 1 A 0
## 2 A 2
## 3 A 5
## 4 A 6
## 5 A 9
## 6 B 0
## 7 B 2
## 8 B 5
## 9 B 6

filter values in a list of dataframes based on a vector, and add rows for vector values not contained in dataframes

  1. Create a data.frame with all the rows you want

    data.frame(province=vector)

  2. Merge this with the data frame you do have, setting all.x=TRUE (so every row from point 1 is retained, and filled with NA if necessary)

    merge(data.frame(province=vector), df1, all.x=TRUE)

  3. Done!

> merge(data.frame(province=vector), df1, all.x=TRUE)
province value value2
1 prov1 23 25
2 prov2 NA NA
3 prov3 56 57
4 prov4 NA NA
5 prov5 93 83
6 prov6 NA NA
  • Bonus 1: you can trivially loop this with lapply

    lapply(list_df, function(df) merge(data.frame(province=vector), df, all.x=TRUE))

    (if you have a lot of data frames you want to apply this to, you will probably want to avoid re-building the vector data frame anonymously each time but create it as a named data frame instead)

  • Bonus 2: all base-r with no dependencies whatsoever

  • Bonus 3: you did say it doesn't matter, but the rows are in order as in vector

How to filter a table's row based on an external vector?

Use the %in% operator.

#Sample data
dat <- data.frame(patients = 1:5, treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"), response = rnorm(5))

#List of hospitals we want to do further analysis on
goodHosp <- c("yyy", "uuu")

You can either index directly into your data.frame object:

dat[dat$hospital %in% goodHosp ,]

or use the subset command:

subset(dat, hospital %in% goodHosp)

Filtering a data frame on a vector

You can use the %in% operator:

> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31

If your IDs are unique, you can use match():

> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5

or make them the rownames of your dataframe and extract by row:

> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5

Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.

dplyr: Filter based on a vector

This is a case of structure of dataset i.e. with data.frame, if we use [,col], it uses drop = TRUE and coerces it to vector, while for data.table or tibble, by default, it is drop = FALSE, thus returning the tibble itself with single column. The documentation can be found in ?Extract. Safe option is [[ which have the same behavior in extraction of column as a vector

vector_df1 <- df[[3]]

According to ?Extract, the default usage is

x[i, j, ... , drop = TRUE]

and it is specified as

or matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.

The documentation for tibble can be found in ?"tbl_df-class"

df[, j] returns a tibble; it does not automatically extract the column inside. df[, j, drop = FALSE] is the default. Read more in subsetting.

Subset dataframe rows based on character vector when %in% and which are not working

(Just adding my comment as an answer since it was posted before the other ones)

The problem is that in vec you have dots, whereas in df$Specimen.Label you have hyphens, so your first commands do not return anything. If you write instead

df[df$Specimen.Label %in% gsub("\\.", "-", vec),]

you obtain

#     PCC Participant.ID                    Specimen.Label
# 3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
# 6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2

Another base R option is to use the function subset

subset(df, Specimen.Label %in% gsub("\\.", "-", vec))

Filtering rows based on partial matching between a data frame and a vector

We can paste the elements of 'vector' into a single string collapsed by | and usse that in grepl or str_detect to filter the rows

library(dplyr)
library(stringr)
df %>%
filter(str_detect(nam, str_c(vector, collapse="|")))
# nam aa
#1 mmu_mir-1-3p 12854
#2 mmu_mir-1-5p 36
#3 mmu-mir-3-5p 5489
#4 mmu-mir-6-3p 2563

In base R, this can be done with subset/grepl

subset(df, grepl(paste(vector, collapse= "|"), nam))

How to filter a dataframe using a preset vector in R

Use %in%:

df %>% 
filter(code %in% x)


Related Topics



Leave a reply



Submit