R Subset with Condition Using %In% or ==. Which One Should Be Used

R subset with condition using %in% or ==. Which one should be used?

You should use the first one %in% because you got the result only because in the example dataset, it was in the order of recycling of A, D. Here, it is comparing

rep(c("A", "D"), length.out= nrow(x))
# 1] "A" "D" "A" "D" "A" "D" "A" "D" "A" "D"

x$v==rep(c("A", "D"), length.out= nrow(x))# only because of coincidence
#[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

subset(x, v == c("D","A"))
#[1] u v
#<0 rows> (or 0-length row.names)

while in the above

 x$v==rep(c("D", "A"), length.out= nrow(x))
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

whereas %in% works

subset(x, v %in% c("D","A"))
# u v
#1 1 A
#4 4 D

How to combine multiple conditions to subset a data-frame using OR ?

my.data.frame <- subset(data , V1 > 2 | V2 < 4)

An alternative solution that mimics the behavior of this function and would be more appropriate for inclusion within a function body:

new.data <- data[ which( data$V1 > 2 | data$V2 < 4) , ]

Some people criticize the use of which as not needed, but it does prevent the NA values from throwing back unwanted results. The equivalent (.i.e not returning NA-rows for any NA's in V1 or V2) to the two options demonstrated above without the which would be:

 new.data <- data[ !is.na(data$V1 | data$V2) & ( data$V1 > 2 | data$V2 < 4)  , ]

Note: I want to thank the anonymous contributor that attempted to fix the error in the code immediately above, a fix that got rejected by the moderators. There was actually an additional error that I noticed when I was correcting the first one. The conditional clause that checks for NA values needs to be first if it is to be handled as I intended, since ...

> NA & 1
[1] NA
> 0 & NA
[1] FALSE

Order of arguments may matter when using '&".

Subsetting in R using OR condition with strings

First of all (as Jonathan done in his comment) to reference second column you should use either data[[2]] or data[,2]. But if you are using subset you could use column name: subset(data, CompanyName == ...).

And for you question I will do one of:

subset(data, data[[2]] %in% c("Company Name 09", "Company Name"), drop = TRUE) 
subset(data, grepl("^Company Name", data[[2]]), drop = TRUE)

In second I use grepl (introduced with R version 2.9) which return logical vector with TRUE for match.

When should I use which for subsetting?


Since this question is specifically about subsetting, I thought I would
illustrate some of the performance benefits of using which() over a
logical subset brought up in the linked question.

When you want to extract the entire subset, there is not much difference in
processing speed, but using which() needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange
findings), which() has a significant speed and memory advantage due to
being able to avoid subsetting a data frame twice by subsetting the result of
which() instead.

Here are the benchmarks:

df <- ggplot2::diamonds; dim(df)
#> [1] 53940 10
mu <- mean(df$price)

bench::press(
n = c(sum(df$price > mu), 10),
{
i <- seq_len(n)
bench::mark(
logical = df[df$price > mu, ][i, ],
which_1 = df[which(df$price > mu), ][i, ],
which_2 = df[which(df$price > mu)[i], ]
)
}
)
#> Running with:
#> n
#> 1 19657
#> 2 10
#> # A tibble: 6 x 11
#> expression n min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 logical 19657 1.5ms 1.81ms 1.71ms 3.39ms 553. 5.5MB
#> 2 which_1 19657 1.41ms 1.61ms 1.56ms 2.41ms 620. 2.89MB
#> 3 which_2 19657 826.56us 934.72us 910.88us 1.41ms 1070. 1.76MB
#> 4 logical 10 893.12us 1.06ms 1.02ms 1.93ms 941. 4.21MB
#> 5 which_1 10 814.4us 944.81us 908.16us 1.78ms 1058. 1.69MB
#> 6 which_2 10 230.72us 264.45us 249.28us 1.08ms 3781. 498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-08-19 by the reprex package (v0.2.0).

Subset dataframe by multiple logical conditions of rows to remove

The ! should be around the outside of the statement:

data[!(data$v1 %in% c("b", "d", "e")), ]

v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g

Subset / filter rows in a data frame based on a condition in a column

Here are the two main approaches. I prefer this one for its readability:

bar <- subset(foo, location == "there")

Note that you can string together many conditionals with & and | to create complex subsets.

The second is the indexing approach. You can index rows in R with either numeric, or boolean slices. foo$location == "there" returns a vector of T and F values that is the same length as the rows of foo. You can do this to return only rows where the condition returns true.

foo[foo$location == "there", ]

Subset data based on conditional statement

In dplyr , you can find out the first index where the condition is met and select rows which occur before the condition is satisfied for each group.

library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() <= which(D1 == 0 & D2 == 0 | D2 == 1)[1])

# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
#1 1 3 0 1
#2 2 5 1 0
#3 2 4 1 0
#4 2 3 0 1
#5 3 9 0 0

The above works assuming that at least one row in each group satisfies the condition. A general case, where there might be instances that none of the row satisfies the condition and we want to select all the rows in the group we can use :

df %>%
group_by(id) %>%
slice({
inds <- which(D1 == 0 & D2 == 0 | D2 == 1)[1]
if(!is.na(inds)) -((inds + 1):n()) else seq_len(n())})

R: subset dataframe by two conditions based on one column

You'll need to compute a grouped summary to achieve this. That is, you want
to find out for each loc if all of the areas in that location are > 0.
I have always found base R a bit awkward for grouped statistics, but here's
one way to achieve that.

First, use tapply() to determine for each loc whether it should be
included or not:

(include <- tapply(dd$area, dd$loc, function(x) all(x > 0)))
#> a b c d
#> TRUE FALSE FALSE TRUE

Then we can use loc values to index that result to get a vector suitable
to subset dd with:

include[dd$loc]
#> a a b b c c d d
#> TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE

dd[include[dd$loc], ]
#> loc type area
#> 1 a npr 10
#> 2 a buff 20
#> 7 d npr 5
#> 8 d buff 5

We can also put these steps together inside a subset() call to avoid
creating extra variables:

subset(dd, tapply(area, loc, function(x) all(x > 0))[loc])
#> loc type area
#> 1 a npr 10
#> 2 a buff 20
#> 7 d npr 5
#> 8 d buff 5

Alternatively, you could use dplyr:

library(dplyr)

dd %>%
group_by(loc) %>%
filter(all(area > 0))
#> # A tibble: 4 x 3
#> # Groups: loc [2]
#> loc type area
#> <fct> <fct> <dbl>
#> 1 a npr 10
#> 2 a buff 20
#> 3 d npr 5
#> 4 d buff 5

Created on 2018-07-25 by the reprex package (v0.2.0.9000).

Subset with condition in data table

You can do :

library(data.table)
tmp[, .SD[!(id1 == max(id1) & time > 2)], user_id]

# user_id id1 time
#1: 1 1 1
#2: 1 1 2
#3: 1 1 3
#4: 1 1 4
#5: 1 3 1
#6: 1 3 2
#7: 2 2 1
#8: 2 2 2


Related Topics



Leave a reply



Submit