Filter data frame rows based on values in vector
nzcoops is spot on with his suggestion. I posed this question in the R Chat a while back and Paul Teetor suggested defining a new function:
`%notin%` <- function(x,y) !(x %in% y)
Which can then be used as follows:
foo <- letters[1:6]
> foo[foo %notin% c("a", "c", "e")]
[1] "b" "d" "f"
Needless to say, this little gem is now in my R profile and gets used quite often.
Select rows from a data frame based on values in a vector
Have a look at ?"%in%"
.
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element
:
dt[is.element(dt$fct, vc),]
Filter data frame matching all values of a vector
Here's another dplyr
solution without ever leaving the pipe:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
x <- data.frame(ID, Hour)
testVector <- c('0','2','5')
x %>%
group_by(ID) %>%
mutate(contains = Hour %in% testVector) %>%
summarise(all = sum(contains)) %>%
filter(all > 2) %>%
select(-all) %>%
inner_join(x)
## ID Hour
## <fctr> <fctr>
## 1 A 0
## 2 A 2
## 3 A 5
## 4 A 6
## 5 A 9
## 6 B 0
## 7 B 2
## 8 B 5
## 9 B 6
filter values in a list of dataframes based on a vector, and add rows for vector values not contained in dataframes
Create a data.frame with all the rows you want
data.frame(province=vector)
Merge this with the data frame you do have, setting
all.x=TRUE
(so every row from point 1 is retained, and filled withNA
if necessary)merge(data.frame(province=vector), df1, all.x=TRUE)
Done!
> merge(data.frame(province=vector), df1, all.x=TRUE)
province value value2
1 prov1 23 25
2 prov2 NA NA
3 prov3 56 57
4 prov4 NA NA
5 prov5 93 83
6 prov6 NA NA
Bonus 1: you can trivially loop this with
lapply
lapply(list_df, function(df) merge(data.frame(province=vector), df, all.x=TRUE))
(if you have a lot of data frames you want to apply this to, you will probably want to avoid re-building the vector data frame anonymously each time but create it as a named data frame instead)
Bonus 2: all
base-r
with no dependencies whatsoeverBonus 3: you did say it doesn't matter, but the rows are in order as in
vector
How to filter a table's row based on an external vector?
Use the %in%
operator.
#Sample data
dat <- data.frame(patients = 1:5, treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"), response = rnorm(5))
#List of hospitals we want to do further analysis on
goodHosp <- c("yyy", "uuu")
You can either index directly into your data.frame object:
dat[dat$hospital %in% goodHosp ,]
or use the subset command:
subset(dat, hospital %in% goodHosp)
Filtering a data frame on a vector
You can use the %in%
operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match()
:
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table
package.
dplyr: Filter based on a vector
This is a case of structure of dataset i.e. with data.frame
, if we use [,col]
, it uses drop = TRUE
and coerces it to vector
, while for data.table
or tibble
, by default, it is drop = FALSE
, thus returning the tibble itself with single column. The documentation can be found in ?Extract
. Safe option is [[
which have the same behavior in extraction of column as a vector
vector_df1 <- df[[3]]
According to ?Extract
, the default usage is
x[i, j, ... , drop = TRUE]
and it is specified as
or matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
The documentation for tibble
can be found in ?"tbl_df-class"
df[, j] returns a tibble; it does not automatically extract the column inside. df[, j, drop = FALSE] is the default. Read more in subsetting.
Subset dataframe rows based on character vector when %in% and which are not working
(Just adding my comment as an answer since it was posted before the other ones)
The problem is that in vec
you have dots, whereas in df$Specimen.Label
you have hyphens, so your first commands do not return anything. If you write instead
df[df$Specimen.Label %in% gsub("\\.", "-", vec),]
you obtain
# PCC Participant.ID Specimen.Label
# 3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
# 6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2
Another base R option is to use the function subset
subset(df, Specimen.Label %in% gsub("\\.", "-", vec))
Filtering rows based on partial matching between a data frame and a vector
We can paste
the elements of 'vector' into a single string collapsed by |
and usse that in grepl
or str_detect
to filter
the rows
library(dplyr)
library(stringr)
df %>%
filter(str_detect(nam, str_c(vector, collapse="|")))
# nam aa
#1 mmu_mir-1-3p 12854
#2 mmu_mir-1-5p 36
#3 mmu-mir-3-5p 5489
#4 mmu-mir-6-3p 2563
In base R
, this can be done with subset/grepl
subset(df, grepl(paste(vector, collapse= "|"), nam))
How to filter a dataframe using a preset vector in R
Use %in%
:
df %>%
filter(code %in% x)
Related Topics
Remove Multiple Objects with Rm()
Split Up a Dataframe by Number of Rows
Seeing If Data Is Normally Distributed in R
How to One Hot Encode Several Categorical Variables in R
Ggplot Separate Legend and Plot
Remove All of X Axis Labels in Ggplot
Problems When Trying to Load a Package in R Due to Rjava
How to Spread Columns with Duplicate Identifiers
Split a Column of Concatenated Comma-Delimited Data and Recode Output as Factors
Dynamically Creating Tabs with Plots in Shiny Without Re-Creating Existing Tabs
How to Find All Functions in an R Package
How to Remove an Element from a List
Code to Import Data from a Stack Overflow Query into R
Reading Multiple Files and Calculating Mean Based on User Input
Getting Strings Recognized as Variable Names in R
Why am I Getting X. in My Column Names When Reading a Data Frame