Why Is Subsetting on a "Logical" Type Slower Than Subsetting on "Numeric" Type

Subsetting does not work in numeric values

The issue results from round-off errors. You could set a tolerance value when comparing Lon & Lat with a certain value. In base you could use abs(x-y) < 1e-5 to achieve it:

t2m.all |>
subset(abs(Lon - -100.7) < 1e-5 & abs(Lat - 59.6) < 1e-5)

# Lon Lat X1958.01.01.00.00.00
# 1 -100.7 59.6 -32.9

The dplyr equivalent is near():

library(dplyr)

t2m.all %>%
filter(near(Lon, -100.7) & near(Lat, 59.6))

# Lon Lat X1958.01.01.00.00.00
# 1 -100.7 59.6 -32.9

Logical condition while subsetting not giving correct values

From ?base::Logic, help('&'), help('|'), etc

See Syntax for the precedence of these operators: unlike many other languages (including S) the AND and OR operators do not have the same precedence (the AND operators have higher precedence than the OR operators).

which explains why

TRUE | TRUE & FALSE
# [1] TRUE

which is essentially

TRUE | (TRUE & FALSE)

which is also true, and a simplification of what you are doing here:

(project$DC31==1&project$D14==2) |
(project$DC31==2&project$D14==1) &
!is.na(project$DC31) &
!is.na(project$D14) &
project$ROLL.NO. == 3131

since you expect the result only to contain some project$ROLL.NO. == 3131 I assume, so even if some of these are false, if one or more OR is true, you may get some that are not ROLL.NO. which are not 3131

Also note that ! has a higher precedence than logicals

Most efficient way of subsetting vectors

Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:

f1 = function(x, y) {
sub.mean <- mean(x[y])
sub.var <- var(x[y])
}

f2 = function(x, y) {
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
}

x = rnorm(10000000)
y = rbinom(10000000, 1, .5)

print(system.time(f1(x, y)))
# user system elapsed
# 0.403 0.037 0.440
print(system.time(f2(x, y)))
# user system elapsed
# 0.233 0.002 0.235

This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).

Subsetting a named numeric for top N values in R

Does this solve your problem?

library(tidyverse)
#install.packages("isotree")
library(isotree)

set.seed(1)

m <- 100

n <- 2

X <- matrix(rnorm(m * n), nrow = m)

# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))

# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)

# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)

X[which.max(pred), ]
#> [1] 3 3

# Perhaps this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = pred, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 -1.523567 -1.4672500 0.6496666
#> 3 -2.214700 -0.6506964 0.5982211

# Or maybe this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = X1, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 2.401618 0.4251004 0.5014570
#> 3 2.172612 0.2075383 0.4811756

Created on 2022-04-06 by the reprex package (v2.0.1)

How do I subset my data.frame by field type (e.g., numeric, character)?

[Since it worked, I'm posting my comment as an answer to this:]

Try lapply(DATA,class)

Subsetting a table in R

Subset the data before running table, example:

ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0

# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2

How to subset specific values from whole data.frame without defining every column?

Here is a faster method than apply using max.col, matrix subsetting, and logical subsetting.
First, construct a sample dataset.

set.seed(1234)
dat <- data.frame(a=sample(1:3, 5, replace=TRUE),
b=sample(1:4, 5, replace=TRUE),
c=sample(1:6, 5, replace=TRUE))

It looks like this.

dat
a b c
1 1 3 5
2 2 1 4
3 2 1 2
4 2 3 6
5 3 3 2

Notice that only the third column has values greater than 4 and that only 2 such elements in the column pass the test. Now, we do

dat[dat[cbind(seq_along(dat[[1]]), max.col(dat))] > 4, ]
a b c
1 1 3 5
4 2 3 6

Here, max.col(dat) returns the column with the maximum value for each row. seq_along(dat[[1]]) runs through the row numbers. cbind returns a matrix that we use to pull out the maximum value for each row using matrix subsetting. Then, compare these values to see if any are greater than 4 with > 4, which returns a logical vector whose length is the number of rows of the data.frame. This is used to subset the data.frame by row.



Related Topics



Leave a reply



Submit