Difference Between Subset and Filter from Dplyr

Difference between subset and filter from dplyr

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr)
library(microbenchmark)

# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b

# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows

microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a

# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows

microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)

Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

What is difference between subset function and filter function in R?

Not sure if I'm correct, but seems like inside filter you can't make a reference combining $ and [] in the same expression as in interval==impute[1,]$interval. Instead you could try:

x < -which(colnames(impute)=="interval")

library(dplyr)
impute[1,]$steps <- filter(steps_per_interval,
interval==impute[1,x])[,2]

Subsetting in R vs filter(from dplyr) giving different results

If there are NAs make sure to adjust for the NA elements with is.na or else filter by default will remove those rows

library(dplyr)
filter(house2, (datetime >= "2007-02-01 00:00:00" &
datetime <= "2007-02-03 00:00:00")|
is.na(datetime))

According to ?filter

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.

How to use select() inside between() inside filter() to subset data dplyr r

Combine multiple conditions using & -

library(dplyr)

data %>%
filter(SiteID == "A" & between(Seconds, 2, 8) |
SiteID == "B" & between(Seconds, 3, 6) |
SiteID == "C" & between(Seconds, 8, 10)|
SiteID == "D" & between(Seconds, 1, 6) |
SiteID == "E" & between(Seconds, 2, 9))

In R: subset or dplyr::filter with variable from vector

You can use df[,"a"] or df[,1]:

df <- data.frame(a = LETTERS[1:4], b = rnorm(4))
vals <- c("B","D")

dplyr::filter(df, df[,1] %in% vals)
# a b
# 2 B 0.4481627
# 4 D 0.2916513

subset(df, df[,1] %in% vals)
# a b
# 2 B 0.4481627
# 4 D 0.2916513

dplyr::filter(df, df[,"a"] %in% vals)
# a b
# 2 B 0.4481627
# 4 D 0.2916513

subset(df, df[,"a"] %in% vals)
# a b
# 2 B 0.4481627
# 4 D 0.2916513

Working with dplyr::tbl_df(df)

Some magic with lazyeval::interp helps us!

df <- dplyr::tbl_df(df)
expr <- lazyeval::interp(quote(x %in% y), x = as.name(names(df)[1]), y = vals)

df %>% filter_(expr)
# Source: local data frame [2 x 2]
#
# a b
# 1 B 0.4481627
# 2 D 0.2916513

How to use or/and in dplyr to subset a data.frame

dplyr solution:

load library:

library(dplyr)

filter with condition as above:

df %>% filter(A == 1 & B == 3 | A == 3 & B ==2)

Why is `[` better than `subset`?

This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset (and functions like it) [here]. Go read it!

It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":

Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:

scramble <- function(x) x[sample(nrow(x)), ]

subscramble <- function(x, condition) {
scramble(subset(x, condition))
}

subscramble(mtcars, cyl == 4)

This returns the error:

Error in eval(expr, envir, enclos) : object 'cyl' not found

because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:

cyl <- 4
subscramble(mtcars, cyl == 4)

cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)

(Run them and see for yourself, it's pretty crazy.)



Related Topics



Leave a reply



Submit