Subset rows in a data frame based on a vector of values
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index]
subsets columns for a 2d object
. If you want to subset rows and keep all columns you have to use the specificationobject[index_rows, index_columns]
, while index_cols
can be left blank, which will use all columns by default.
However, you still need to include the ,
to indicate that you want to get a subset of rows instead of a subset of columns.
Select rows from a data frame based on values in a vector
Have a look at ?"%in%"
.
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element
:
dt[is.element(dt$fct, vc),]
Subset dataframe rows based on character vector when %in% and which are not working
(Just adding my comment as an answer since it was posted before the other ones)
The problem is that in vec
you have dots, whereas in df$Specimen.Label
you have hyphens, so your first commands do not return anything. If you write instead
df[df$Specimen.Label %in% gsub("\\.", "-", vec),]
you obtain
# PCC Participant.ID Specimen.Label
# 3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
# 6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2
Another base R option is to use the function subset
subset(df, Specimen.Label %in% gsub("\\.", "-", vec))
Subsetting a dataframe based on a vector of strings
We can use duplicated
to get ID
that are multiplicated and use that to subset
data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID
and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
Select rows of data frame based on a vector with duplicated values
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
Simple and efficient way to subset a data frame using values and names in a vector
Personally, I wonder whether it is a good idea to use a named vector to subset a dataframe, since it can only be used for equality =
, while larger than
and smaller than
cannot be expressed this way. I would recommend using a quoted expression instead of a named vector (see approach below).
However, I figured out a tidyverse
way to write a function with said functionality:
library(tidyverse)
set.seed(123)
n <- 10
ds.df <- data.frame(col1 = round(rnorm(n,2,4), digit=1),
col2 = sample.int(2, n, replace=T),
col3 = sample.int(n*10, n),
col4 = sample(letters, n, replace=T))
new_filter <- function (data, expr) {
exprs_ls <- purrr::imap(expr, ~ rlang::exprs(!! rlang::sym(.y) == !!.x))
filter(data, !!! unname(unlist(exprs_ls)))
}
new_filter(ds.df, c(col1 = -0.2, col4 = "i"))
#> col1 col2 col3 col4
#> 1 -0.2 1 9 i
Created on 2020-06-17 by the reprex package (v0.3.0)
Below is my alternative approach.
In base R you can use quote
to quote the subset expression (instead of creating a vector) and then you can use eval to evaluate it inside subset
.
n <- 10
ds.df=data.frame(col1=round(rnorm(n,2,4),digit=1),
col2=sample.int(2,n,replace=T),
col3=sample.int(n*10,n),
col4=sample(letters,n,replace=T))
subset_v = quote(col1 > 2 & col3 > 40)
subset(ds.df, eval(subset_v))
#> col1 col2 col3 col4
#> 1 6.6 1 93 m
#> 2 7.0 2 62 j
#> 4 3.9 1 94 t
#> 7 4.5 1 46 r
#> 8 2.8 2 98 h
#> 10 4.9 1 78 p
Created on 2020-06-17 by the reprex package (v0.3.0)
Same approach but using dplyr filter
library(dplyr)
n <- 10
ds.df = data.frame(col1 = round(rnorm(n,2,4), digit=1),
col2 = sample.int(2, n, replace=T),
col3 = sample.int(n*10, n),
col4 = sample(letters, n, replace=T))
filter_v = expr(col1 > 2 & col3 > 40)
filter(ds.df, !! filter_v)
#> col1 col2 col3 col4
#> 1 3.3 1 70 a
#> 2 2.5 2 82 q
#> 3 3.6 1 51 z
Created on 2020-06-17 by the reprex package (v0.3.0)
Subset a data frame based on value pairs stored in independent ordered vectors
You could try match
which an appropriated nomatch
argument:
sub <- match(DATA$A, AList, nomatch=-1) == match(DATA$B, BList, nomatch=-2)
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2
A paste
based approach would also be possible:
sub <- paste(DATA$A, DATA$B, sep=":") %in% paste(AList, BList, sep=":")
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2
In R, how do you subset rows of a dataframe based on values in a vector
You can use Reduce
to construct the OR condition:
subset(df, Reduce("|", lapply(df, `%in%`, L)))
# id1 id2
#2 B V
#10 C B
#11 F A
#14 A F
#19 E S
Or use rowSums
to check if there is any letter matching in each row:
subset(df, rowSums(sapply(df, `%in%`, L)) != 0)
# id1 id2
#2 B V
#10 C B
#11 F A
#14 A F
#19 E S
R: Subsetting rows based on vector
Try with match
in base R:
with(df, B[match(sel, A)])
#[1] 5 1 5 5
Related Topics
How to Set Multiple Legends/Scales For the Same Aesthetic in Ggplot2
How to Order Data by Value Within Ggplot Facets
What Does .Sd Stand For in Data.Table in R
Create a Co-Occurrence Matrix from Dummy-Coded Observations
Ggplot2 Change Axis Limits For Each Individual Facet Panel
Difference: "Compile Pdf" Button in Rstudio Vs. Knit() and Knit2Pdf()
Add a New Column of the Sum by Group
Latitude Longitude Coordinates to State Code in R
Remove Legend Entries For Some Factors Levels
Remove an Entire Column from a Data.Frame in R
Aggregate a Data Frame Based on Unordered Pairs of Columns
What Do Hjust and Vjust Do When Making a Plot Using Ggplot
Rcpp Pass by Reference Vs. by Value