Subset by Multiple Ranges

Subset by multiple ranges

Using the non-equi join possibility of data.table:

values[range, on = .(value >= start, value <= end), .(results = x.value)]

which gives:

    results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92

Or as per the suggestion of @Henrik: values[value %inrange% range]. This works also very well on data.table's with multiple columns:

# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))

> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46

subset the data frame based on multiple ranges and save each range as element in the list

You can split the data frame according to levels obtained by cutting df$x by range$start. You don't even need a loop for this:

nlist <- split(df, cut(df$x, breaks = c(-Inf, range$start, Inf)))

Or if you want it in the same format (an unnamed list in reverse order, you can do:

nlist <- setNames(rev(split(df, cut(df$x, breaks=c(-Inf, range$start, Inf)))),NULL)

This also gives the correct answer for Reduce:

Reduce('+', lapply(nlist, nrow))
#> [1] 34

How to create subsets of multiple date ranges in R

You can try looping through index

for (i in seq_along(date_ranges$start_dates)){
print (
df %>%
filter(between(df_date, date_ranges$start_dates[i], date_ranges$end_dates[i])))
}

Subset data frame in R based on matching multiple ranges for multiple variables

You can use outer to calculate all pairwise differences between df and realdata and examine if both x and y are less than the tolerance

tolerance <- .10

# x
xx <- abs(outer(df$x, realdata$x, "-")) < tolerance
# y
yy <- abs(outer(df$y, realdata$y, "-")) < tolerance

# if both are within the tolerance the sum of xx and yy will be 2
(mat <- xx + yy > 1)
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE TRUE
#[5,] FALSE FALSE FALSE
#[6,] FALSE FALSE FALSE

So the first column of mat shows which rows of df are within the tolerance (in this case the first).

Rather inelegantly return the row of matches in df in the order of the rows of realdata

lapply(1:ncol(mat), function(i) df[mat[,i], ])

# return all matched data
df[row(mat)[mat], ]

r subset by multiple columns

If I understand your explanation correctly along with the expected output shown you are looking for something like this -

library(dplyr)

df %>%
group_by(ID) %>%
filter(ifelse(Sex == 'M' & between(Age, 6,11),
between(Score, 34, 100), TRUE)) %>%
ungroup

# ID Sex Age Score
# <int> <chr> <dbl> <int>
#1 1 M 4.2 19
#2 1 M 4.8 21
#3 2 F 6.1 23
#4 2 F 6.7 45
#5 3 F 9.4 39
#6 5 M 10 56

between(Score, 34, 100) is only checked when the Sex is 'M' and Age is between 6 and 11.

Subsetting in python using multiple row ranges

You can use numpy.r_ for selecting multiple ranges at once:

Try this:

import numpy as np
plt.scatter(df.iloc[np.r_[0:34, 80:101], 1], df.iloc[np.r_[0:34, 80:101], 0])

Subset multiple columns in R with multiple matches

You can use rowSums :

df[rowSums(df[-1] == criteria) >= 2, ]

# x Col1 Col2 Col3
#1 1 A A A
#4 4 B A A

If criteria is of length > 1 you cannot use == directly in which case use sapply with %in%.

df[rowSums(sapply(df[-1], `%in%`, criteria)) >= 2, ]

In dplyr you can use filter with rowwise :

library(dplyr)
df %>%
rowwise() %>%
filter(sum(c_across(starts_with('col')) %in% criteria) >= 2)


Related Topics



Leave a reply



Submit