Subsetting a Data.Frame Given Some Criteria

Subset / filter rows in a data frame based on a condition in a column

Here are the two main approaches. I prefer this one for its readability:

bar <- subset(foo, location == "there")

Note that you can string together many conditionals with & and | to create complex subsets.

The second is the indexing approach. You can index rows in R with either numeric, or boolean slices. foo$location == "there" returns a vector of T and F values that is the same length as the rows of foo. You can do this to return only rows where the condition returns true.

foo[foo$location == "there", ]

Subsetting a data.frame given some criteria

Try this one:

dat <- data.frame(Age=c(1,1,1,1,4,4,4),Height=c(0.5,0.6,0.7,0.6,2.0,2.3,2.3))

dat[dat$Age==4,2]

How to combine multiple conditions to subset a data-frame using OR ?

my.data.frame <- subset(data , V1 > 2 | V2 < 4)

An alternative solution that mimics the behavior of this function and would be more appropriate for inclusion within a function body:

new.data <- data[ which( data$V1 > 2 | data$V2 < 4) , ]

Some people criticize the use of which as not needed, but it does prevent the NA values from throwing back unwanted results. The equivalent (.i.e not returning NA-rows for any NA's in V1 or V2) to the two options demonstrated above without the which would be:

 new.data <- data[ !is.na(data$V1 | data$V2) & ( data$V1 > 2 | data$V2 < 4)  , ]

Note: I want to thank the anonymous contributor that attempted to fix the error in the code immediately above, a fix that got rejected by the moderators. There was actually an additional error that I noticed when I was correcting the first one. The conditional clause that checks for NA values needs to be first if it is to be handled as I intended, since ...

> NA & 1
[1] NA
> 0 & NA
[1] FALSE

Order of arguments may matter when using '&".

How can I subset from a data frame a value in a column that matches criteria from multiple identical entries?

You could for example use:

aggregate(CX1h$netphorest, list(CX1h$uniprot,ddd$site), max)

(EDIT: as suggested in the comments)

or use a combination of with(),which(), ave() and max()to subset the rows with maximum netphorest values.

How to subset dataframe based on multiple conditions?

You should do:

reshape2::recast(df ,Country + variable ~ Indicator)

R: subset a data frame based on conditions from another data frame

Not efficient , but do the job :

 subset(merge(observations,sampletimes), time > time1 & time < time2)
        id time measurement location time1 time2
    11   1    3    3.180321        a     2     4
    47   1    8    6.040612        e     7     9
    83   1   13   -5.999317        i    12    14
    99   1   18    2.689414        m    17    19
    125  1   23   12.514722        q    22    24
    137  2    8    4.420679        f     7     9
    141  2    3   11.492446        b     2     4
    218  2   13    6.672506        j    12    14
    234  2   18   12.290339        n    17    19
    250  2   23   12.610828        r    22    24
    251  3    3    8.570984        c     2     4
    267  3    8   -7.112291        g     7     9
    283  3   13    6.287598        k    12    14
    360  3   23   11.941846        s    22    24
    364  3   18   -4.199001        o    17    19
    376  4    3    7.133370        d     2     4
    402  4    8   13.477790        h     7     9
    418  4   13    3.967293        l    12    14
    454  4   18   12.845535        p    17    19
    490  4   23   -1.016839        t    22    24

EDIT

Since you have more than 5 millions rows, you should give a try to a data.table solution:

library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

subsetting a data frame based on a condition of one column

We can use %in%

a1 <- a[a$x %in% x,]

For subsetting only the column 'x'

a1 <- a[a$x %in% x, "x", drop=FALSE]

If we need to subset the column 'x' to create a vector based on the x vector

v1 <- a$x[a$x %in% x]

Subset dataframe in a list by a dataframe column criteria

Following the answers and comments of @David Arenburg, @akrun and @shadow, here there are three possible solutions to the problem I posted:

Option 1)

library(data.table)
rbindlist(l)[abs(y - a) == min(abs(y - a))]

Option 2) (needs an R version > 3.1.2)

library(dplyr)
bind_rows(l) %>% filter(abs(y-a)==which.min(abs(y-a)))

Option 3) (also works perfectly, but computationally less faster than the first 2 options if used within a big loop or an iterative process)

l[[which.min(sapply(l, function(df) sum(abs(df$y - a))))]]