R subset with condition using %in% or ==. Which one should be used?
You should use the first one %in%
because you got the result only because in the example dataset, it was in the order of recycling of A
, D
. Here, it is comparing
rep(c("A", "D"), length.out= nrow(x))
# 1] "A" "D" "A" "D" "A" "D" "A" "D" "A" "D"
x$v==rep(c("A", "D"), length.out= nrow(x))# only because of coincidence
#[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
subset(x, v == c("D","A"))
#[1] u v
#<0 rows> (or 0-length row.names)
while in the above
x$v==rep(c("D", "A"), length.out= nrow(x))
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
whereas %in%
works
subset(x, v %in% c("D","A"))
# u v
#1 1 A
#4 4 D
How to combine multiple conditions to subset a data-frame using OR ?
my.data.frame <- subset(data , V1 > 2 | V2 < 4)
An alternative solution that mimics the behavior of this function and would be more appropriate for inclusion within a function body:
new.data <- data[ which( data$V1 > 2 | data$V2 < 4) , ]
Some people criticize the use of which
as not needed, but it does prevent the NA
values from throwing back unwanted results. The equivalent (.i.e not returning NA-rows for any NA's in V1 or V2) to the two options demonstrated above without the which
would be:
new.data <- data[ !is.na(data$V1 | data$V2) & ( data$V1 > 2 | data$V2 < 4) , ]
Note: I want to thank the anonymous contributor that attempted to fix the error in the code immediately above, a fix that got rejected by the moderators. There was actually an additional error that I noticed when I was correcting the first one. The conditional clause that checks for NA values needs to be first if it is to be handled as I intended, since ...
> NA & 1
[1] NA
> 0 & NA
[1] FALSE
Order of arguments may matter when using '&".
Subsetting in R using OR condition with strings
First of all (as Jonathan done in his comment) to reference second column you should use either data[[2]]
or data[,2]
. But if you are using subset you could use column name: subset(data, CompanyName == ...)
.
And for you question I will do one of:
subset(data, data[[2]] %in% c("Company Name 09", "Company Name"), drop = TRUE)
subset(data, grepl("^Company Name", data[[2]]), drop = TRUE)
In second I use grepl
(introduced with R version 2.9) which return logical vector with TRUE
for match.
When should I use which for subsetting?
Since this question is specifically about subsetting, I thought I would
illustrate some of the performance benefits of using which()
over a
logical subset brought up in the linked question.
When you want to extract the entire subset, there is not much difference in
processing speed, but using which()
needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange
findings), which()
has a significant speed and memory advantage due to
being able to avoid subsetting a data frame twice by subsetting the result ofwhich()
instead.
Here are the benchmarks:
df <- ggplot2::diamonds; dim(df)
#> [1] 53940 10
mu <- mean(df$price)
bench::press(
n = c(sum(df$price > mu), 10),
{
i <- seq_len(n)
bench::mark(
logical = df[df$price > mu, ][i, ],
which_1 = df[which(df$price > mu), ][i, ],
which_2 = df[which(df$price > mu)[i], ]
)
}
)
#> Running with:
#> n
#> 1 19657
#> 2 10
#> # A tibble: 6 x 11
#> expression n min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 logical 19657 1.5ms 1.81ms 1.71ms 3.39ms 553. 5.5MB
#> 2 which_1 19657 1.41ms 1.61ms 1.56ms 2.41ms 620. 2.89MB
#> 3 which_2 19657 826.56us 934.72us 910.88us 1.41ms 1070. 1.76MB
#> 4 logical 10 893.12us 1.06ms 1.02ms 1.93ms 941. 4.21MB
#> 5 which_1 10 814.4us 944.81us 908.16us 1.78ms 1058. 1.69MB
#> 6 which_2 10 230.72us 264.45us 249.28us 1.08ms 3781. 498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>
Created on 2018-08-19 by the reprex package (v0.2.0).
Subset dataframe by multiple logical conditions of rows to remove
The !
should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
Subset / filter rows in a data frame based on a condition in a column
Here are the two main approaches. I prefer this one for its readability:
bar <- subset(foo, location == "there")
Note that you can string together many conditionals with &
and |
to create complex subsets.
The second is the indexing approach. You can index rows in R with either numeric, or boolean slices. foo$location == "there"
returns a vector of T
and F
values that is the same length as the rows of foo
. You can do this to return only rows where the condition returns true.
foo[foo$location == "there", ]
Subset data based on conditional statement
In dplyr
, you can find out the first index where the condition is met and select rows which occur before the condition is satisfied for each group.
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() <= which(D1 == 0 & D2 == 0 | D2 == 1)[1])
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
#1 1 3 0 1
#2 2 5 1 0
#3 2 4 1 0
#4 2 3 0 1
#5 3 9 0 0
The above works assuming that at least one row in each group satisfies the condition. A general case, where there might be instances that none of the row satisfies the condition and we want to select all the rows in the group we can use :
df %>%
group_by(id) %>%
slice({
inds <- which(D1 == 0 & D2 == 0 | D2 == 1)[1]
if(!is.na(inds)) -((inds + 1):n()) else seq_len(n())})
R: subset dataframe by two conditions based on one column
You'll need to compute a grouped summary to achieve this. That is, you want
to find out for each loc
if all of the area
s in that location are > 0.
I have always found base R a bit awkward for grouped statistics, but here's
one way to achieve that.
First, use tapply()
to determine for each loc
whether it should be
included or not:
(include <- tapply(dd$area, dd$loc, function(x) all(x > 0)))
#> a b c d
#> TRUE FALSE FALSE TRUE
Then we can use loc
values to index that result to get a vector suitable
to subset dd
with:
include[dd$loc]
#> a a b b c c d d
#> TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
dd[include[dd$loc], ]
#> loc type area
#> 1 a npr 10
#> 2 a buff 20
#> 7 d npr 5
#> 8 d buff 5
We can also put these steps together inside a subset()
call to avoid
creating extra variables:
subset(dd, tapply(area, loc, function(x) all(x > 0))[loc])
#> loc type area
#> 1 a npr 10
#> 2 a buff 20
#> 7 d npr 5
#> 8 d buff 5
Alternatively, you could use dplyr:
library(dplyr)
dd %>%
group_by(loc) %>%
filter(all(area > 0))
#> # A tibble: 4 x 3
#> # Groups: loc [2]
#> loc type area
#> <fct> <fct> <dbl>
#> 1 a npr 10
#> 2 a buff 20
#> 3 d npr 5
#> 4 d buff 5
Created on 2018-07-25 by the reprex package (v0.2.0.9000).
Subset with condition in data table
You can do :
library(data.table)
tmp[, .SD[!(id1 == max(id1) & time > 2)], user_id]
# user_id id1 time
#1: 1 1 1
#2: 1 1 2
#3: 1 1 3
#4: 1 1 4
#5: 1 3 1
#6: 1 3 2
#7: 2 2 1
#8: 2 2 2
Related Topics
Convert Quarter/Year Format to a Date
Si Prefixes in Ggplot2 Axis Labels
Convert List to Data Frame While Keeping List-Element Names
Changing Format of Some Axis Labels in Ggplot2 According to Condition
Dealing with Spaces and "Weird" Characters in Column Names with Dplyr::Rename()
How to Calculate Mean of All Columns, by Group
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid
R Shiny, How to Make Datatable React to Checkboxes in Datatable
How to Change Angle of Line in Customized Legend in Ggplot2
Remove Text After Final Period in String
Determine Level of Nesting in R
Rcpp Warning: "Directory Not Found for Option '-L/Usr/Local/Cellar/Gfortran/4.8.2/Gfortran'"
How to Paste Together the Elements of a Vector in R Without Using a Loop
What Does the Error "Arguments Imply Differing Number of Rows: X, Y" Mean