Subsetting a Data.Frame Given Some Criteria

Subset / filter rows in a data frame based on a condition in a column

Here are the two main approaches. I prefer this one for its readability:

bar <- subset(foo, location == "there")

Note that you can string together many conditionals with & and | to create complex subsets.

The second is the indexing approach. You can index rows in R with either numeric, or boolean slices. foo$location == "there" returns a vector of T and F values that is the same length as the rows of foo. You can do this to return only rows where the condition returns true.

foo[foo$location == "there", ]

Subsetting a data.frame given some criteria

Try this one:

dat <- data.frame(Age=c(1,1,1,1,4,4,4),Height=c(0.5,0.6,0.7,0.6,2.0,2.3,2.3))

dat[dat$Age==4,2]

How to combine multiple conditions to subset a data-frame using OR ?

my.data.frame <- subset(data , V1 > 2 | V2 < 4)

An alternative solution that mimics the behavior of this function and would be more appropriate for inclusion within a function body:

new.data <- data[ which( data$V1 > 2 | data$V2 < 4) , ]

Some people criticize the use of which as not needed, but it does prevent the NA values from throwing back unwanted results. The equivalent (.i.e not returning NA-rows for any NA's in V1 or V2) to the two options demonstrated above without the which would be:

 new.data <- data[ !is.na(data$V1 | data$V2) & ( data$V1 > 2 | data$V2 < 4)  , ]

Note: I want to thank the anonymous contributor that attempted to fix the error in the code immediately above, a fix that got rejected by the moderators. There was actually an additional error that I noticed when I was correcting the first one. The conditional clause that checks for NA values needs to be first if it is to be handled as I intended, since ...

> NA & 1
[1] NA
> 0 & NA
[1] FALSE

Order of arguments may matter when using '&".

How can I subset from a data frame a value in a column that matches criteria from multiple identical entries?

You could for example use:

aggregate(CX1h$netphorest, list(CX1h$uniprot,ddd$site), max)

(EDIT: as suggested in the comments)

or use a combination of with(),which(), ave() and max()to subset the rows with maximum netphorest values.

How to subset dataframe based on multiple conditions?

You should do:

reshape2::recast(df ,Country + variable ~ Indicator)

R: subset a data frame based on conditions from another data frame

Not efficient , but do the job :

 subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24

EDIT

Since you have more than 5 millions rows, you should give a try to a data.table solution:

library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

subsetting a data frame based on a condition of one column

We can use %in%

a1 <- a[a$x %in% x,]

For subsetting only the column 'x'

a1 <- a[a$x %in% x, "x", drop=FALSE]

If we need to subset the column 'x' to create a vector based on the x vector

v1 <- a$x[a$x %in% x]

Subset dataframe in a list by a dataframe column criteria

Following the answers and comments of @David Arenburg, @akrun and @shadow, here there are three possible solutions to the problem I posted:

Option 1)

library(data.table)
rbindlist(l)[abs(y - a) == min(abs(y - a))]

Option 2) (needs an R version > 3.1.2)

library(dplyr)
bind_rows(l) %>% filter(abs(y-a)==which.min(abs(y-a)))

Option 3) (also works perfectly, but computationally less faster than the first 2 options if used within a big loop or an iterative process)

l[[which.min(sapply(l, function(df) sum(abs(df$y - a))))]]


Related Topics



Leave a reply



Submit