Subset Rows in a Data Frame Based on a Vector of Values

Subset rows in a data frame based on a vector of values

This will give you what you want:

eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]

The error in your second attempt is because you forgot the ,

In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.

However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.

Select rows from a data frame based on values in a vector

Have a look at ?"%in%".

dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4

You could also use ?is.element:

dt[is.element(dt$fct, vc),]

Subset dataframe rows based on character vector when %in% and which are not working

(Just adding my comment as an answer since it was posted before the other ones)

The problem is that in vec you have dots, whereas in df$Specimen.Label you have hyphens, so your first commands do not return anything. If you write instead

df[df$Specimen.Label %in% gsub("\\.", "-", vec),]

you obtain

#     PCC Participant.ID                    Specimen.Label
# 3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
# 6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2

Another base R option is to use the function subset

subset(df, Specimen.Label %in% gsub("\\.", "-", vec))

Subsetting a dataframe based on a vector of strings

We can use duplicated to get ID that are multiplicated and use that to subset data

subset(Genetics, ID %in% unique(ID[duplicated(ID)]))

Another approach could be to count number of rows by ID and select rows which are more than 1.

This can be done in base R :

subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)

dplyr

library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)

and data.table

library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]

Select rows of data frame based on a vector with duplicated values

Another method of doing the same without a loop:

sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))

row_names <- split(1:nrow(sample_df),sample_df$y)

select_y = c(1,3,3)

row_num <- unlist(row_names[as.character(select_y)])

ans <- sample_df[row_num,]

Simple and efficient way to subset a data frame using values and names in a vector

Personally, I wonder whether it is a good idea to use a named vector to subset a dataframe, since it can only be used for equality =, while larger than and smaller than cannot be expressed this way. I would recommend using a quoted expression instead of a named vector (see approach below).

However, I figured out a tidyverse way to write a function with said functionality:

library(tidyverse)

set.seed(123)
n <- 10

ds.df <- data.frame(col1 = round(rnorm(n,2,4), digit=1),
col2 = sample.int(2, n, replace=T),
col3 = sample.int(n*10, n),
col4 = sample(letters, n, replace=T))

new_filter <- function (data, expr) {
exprs_ls <- purrr::imap(expr, ~ rlang::exprs(!! rlang::sym(.y) == !!.x))
filter(data, !!! unname(unlist(exprs_ls)))
}

new_filter(ds.df, c(col1 = -0.2, col4 = "i"))
#> col1 col2 col3 col4
#> 1 -0.2 1 9 i

Created on 2020-06-17 by the reprex package (v0.3.0)



Below is my alternative approach.
In base R you can use quote to quote the subset expression (instead of creating a vector) and then you can use eval to evaluate it inside subset.

n <- 10   

ds.df=data.frame(col1=round(rnorm(n,2,4),digit=1),
col2=sample.int(2,n,replace=T),
col3=sample.int(n*10,n),
col4=sample(letters,n,replace=T))


subset_v = quote(col1 > 2 & col3 > 40)

subset(ds.df, eval(subset_v))
#> col1 col2 col3 col4
#> 1 6.6 1 93 m
#> 2 7.0 2 62 j
#> 4 3.9 1 94 t
#> 7 4.5 1 46 r
#> 8 2.8 2 98 h
#> 10 4.9 1 78 p

Created on 2020-06-17 by the reprex package (v0.3.0)



Same approach but using dplyr filter

library(dplyr)

n <- 10

ds.df = data.frame(col1 = round(rnorm(n,2,4), digit=1),
col2 = sample.int(2, n, replace=T),
col3 = sample.int(n*10, n),
col4 = sample(letters, n, replace=T))

filter_v = expr(col1 > 2 & col3 > 40)

filter(ds.df, !! filter_v)

#> col1 col2 col3 col4
#> 1 3.3 1 70 a
#> 2 2.5 2 82 q
#> 3 3.6 1 51 z

Created on 2020-06-17 by the reprex package (v0.3.0)

Subset a data frame based on value pairs stored in independent ordered vectors

You could try match which an appropriated nomatch argument:

sub <- match(DATA$A, AList, nomatch=-1) == match(DATA$B, BList, nomatch=-2)
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE

DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2

A paste based approach would also be possible:

sub <- paste(DATA$A, DATA$B, sep=":") %in% paste(AList, BList, sep=":")
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE

DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2

In R, how do you subset rows of a dataframe based on values in a vector

You can use Reduce to construct the OR condition:

subset(df, Reduce("|", lapply(df, `%in%`, L)))

# id1 id2
#2 B V
#10 C B
#11 F A
#14 A F
#19 E S

Or use rowSums to check if there is any letter matching in each row:

subset(df, rowSums(sapply(df, `%in%`, L)) != 0)

# id1 id2
#2 B V
#10 C B
#11 F A
#14 A F
#19 E S

R: Subsetting rows based on vector

Try with match in base R:

with(df, B[match(sel, A)])

#[1] 5 1 5 5


Related Topics



Leave a reply



Submit