Subsetting does not work in numeric values
The issue results from round-off errors. You could set a tolerance value when comparing Lon
& Lat
with a certain value. In base
you could use abs(x-y) < 1e-5
to achieve it:
t2m.all |>
subset(abs(Lon - -100.7) < 1e-5 & abs(Lat - 59.6) < 1e-5)
# Lon Lat X1958.01.01.00.00.00
# 1 -100.7 59.6 -32.9
The dplyr
equivalent is near()
:
library(dplyr)
t2m.all %>%
filter(near(Lon, -100.7) & near(Lat, 59.6))
# Lon Lat X1958.01.01.00.00.00
# 1 -100.7 59.6 -32.9
Logical condition while subsetting not giving correct values
From ?base::Logic
, help('&')
, help('|')
, etc
See
Syntax
for the precedence of these operators: unlike many other languages (including S) the AND and OR operators do not have the same precedence (the AND operators have higher precedence than the OR operators).
which explains why
TRUE | TRUE & FALSE
# [1] TRUE
which is essentially
TRUE | (TRUE & FALSE)
which is also true, and a simplification of what you are doing here:
(project$DC31==1&project$D14==2) |
(project$DC31==2&project$D14==1) &
!is.na(project$DC31) &
!is.na(project$D14) &
project$ROLL.NO. == 3131
since you expect the result only to contain some project$ROLL.NO. == 3131
I assume, so even if some of these are false, if one or more OR
is true, you may get some that are not ROLL.NO.
which are not 3131
Also note that !
has a higher precedence than logicals
Most efficient way of subsetting vectors
Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:
f1 = function(x, y) {
sub.mean <- mean(x[y])
sub.var <- var(x[y])
}
f2 = function(x, y) {
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
}
x = rnorm(10000000)
y = rbinom(10000000, 1, .5)
print(system.time(f1(x, y)))
# user system elapsed
# 0.403 0.037 0.440
print(system.time(f2(x, y)))
# user system elapsed
# 0.233 0.002 0.235
This isn't surprising- mean(x[y])
does have to create a new object for the mean
function to act on, even if it doesn't add it to the local namespace. Thus, f1
is slower for having to do the subsetting twice (as you surmised).
Subsetting a named numeric for top N values in R
Does this solve your problem?
library(tidyverse)
#install.packages("isotree")
library(isotree)
set.seed(1)
m <- 100
n <- 2
X <- matrix(rnorm(m * n), nrow = m)
# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))
# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)
X[which.max(pred), ]
#> [1] 3 3
# Perhaps this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = pred, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 -1.523567 -1.4672500 0.6496666
#> 3 -2.214700 -0.6506964 0.5982211
# Or maybe this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = X1, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 2.401618 0.4251004 0.5014570
#> 3 2.172612 0.2075383 0.4811756
Created on 2022-04-06 by the reprex package (v2.0.1)
How do I subset my data.frame by field type (e.g., numeric, character)?
[Since it worked, I'm posting my comment as an answer to this:]
Try lapply(DATA,class)
Subsetting a table in R
Subset the data before running table
, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2
How to subset specific values from whole data.frame without defining every column?
Here is a faster method than apply
using max.col
, matrix subsetting, and logical subsetting.
First, construct a sample dataset.
set.seed(1234)
dat <- data.frame(a=sample(1:3, 5, replace=TRUE),
b=sample(1:4, 5, replace=TRUE),
c=sample(1:6, 5, replace=TRUE))
It looks like this.
dat
a b c
1 1 3 5
2 2 1 4
3 2 1 2
4 2 3 6
5 3 3 2
Notice that only the third column has values greater than 4 and that only 2 such elements in the column pass the test. Now, we do
dat[dat[cbind(seq_along(dat[[1]]), max.col(dat))] > 4, ]
a b c
1 1 3 5
4 2 3 6
Here, max.col(dat)
returns the column with the maximum value for each row. seq_along(dat[[1]])
runs through the row numbers. cbind
returns a matrix that we use to pull out the maximum value for each row using matrix subsetting. Then, compare these values to see if any are greater than 4 with > 4
, which returns a logical vector whose length is the number of rows of the data.frame. This is used to subset the data.frame by row.
Related Topics
Looping Through List of Data Frames in R
How to Add a Page Break in Word Document Generated by Rstudio & Markdown
Overlay Geom_Points() on Geom_Boxplot(Fill=Group)
Cbind: How to Have Missing Values Set to Na
Using R Convert Data.Frame to Simple Vector
Best Way to Replace a Lengthy Ifelse Structure in R
R X-Axis Date Labels Using Plot()
Display a Summary Line Per Facet Rather Than Overall
Statistical Test with Test-Data
Output a Good-Looking Matrix Using Rendertable()
Filter a Vector of Strings Based on String Matching
Grouping Every N Minutes with Dplyr
How to Connect to a Remote Server with Ssh in R
How to Pass "Nothing" as an Argument to '[' for Subsetting
Two Y-Axes with Different Scales for Two Datasets in Ggplot2