R Keep Rows with at Least One Column Greater Than Value

R keep rows with at least one column greater than value

You can use rowSums to construct the condition in base R:

df[rowSums(df > 10) >= 1, ]

with dplyr (0.7.0), now you can use filter_all like this:

library(dplyr)
filter_all(df, any_vars(. > 10))

Subsetting Rows with a Column Value Greater than a Threshold

We can use rowSums

data[rowSums(data[5:70] > 7) > 0, ]

Or with subset

subset(data, rowSums(data[5:70] > 7) > 0)

We can also use filter_at from dplyr with any_vars

library(dplyr)
data %>% filter_at(vars(5:70), any_vars(. > 7))

Using reproducible data from mtcars (stealing idea from @Maurits Evers)

mtcars[rowSums(mtcars[3:11] > 300) > 0, ]

# mpg cyl disp hp drat wt qsec vs am gear carb
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
#Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
#Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
#AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
#Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
#Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
#Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
#Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8

Using filter_at also gives the same output

mtcars %>% filter_at(vars(3:11), any_vars(. > 300))

How to filter the rows that have at least one column greater than a threshold?

We can use rowSums directly

df[rowSums(df[2:4] >= 0.5) > 0, ]

# Name Clust1 Clust2 Clust3
#2 BB 0.76946 0.03242 0.029358
#3 CC 0.10990 0.52171 0.283859

Or dplyr version with filter_at and any_vars

library(dplyr)
df %>%
filter_at(vars(starts_with("Clust")), any_vars(. >= 0.5))

and as far as fixing your code is concerned as mentioned by @thelatemail you are including column 1 in rowSums which is the Name column, so you want to subset it on columns 2:4. Also we can directly filter instead of creating new variable with mutate, so the following should work.

df %>% filter(rowSums(.[,c(2:4)] >= 0.5) > 0)

We can also use apply version which would be slow for larger datasets

df[apply(df[2:4] >= 0.5, 1, any), ]

Keep only rows if number is greater than... in specific column

Not entirely sure I understood your problem statement correctly, but perhaps something like this

library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2

Or the same in base R

exp_data[grep("\\d{3}", exp_data$Change), ]
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2

The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.

filter rows that have X number of columns with value greater than Y

A simple and quick way using base R rowSums, where we filter those rows which has value greater than 5.5 in more than one column.

df[rowSums(df > 5.5) > 1, ]

# tumor tumor tumor tumor tumor
#A_33_P3390097 5.698576 5.797294 5.671845 5.961686 5.751165
#GE_BrightCorner 15.596546 15.833930 16.165919 15.274273 16.018045
#NM_001166137 6.432376 6.062449 6.674411 6.475263 6.856038

How to select column values based on a greater than condition in row values

We can create a logical vector by comparing the dataframe with 3 and then take sum of columns using colSums and select only those columns which has at least one value greater than 3 in it.

mtcars[colSums(mtcars > 3) > 0]

# mpg cyl disp hp drat wt qsec gear carb
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
#....

Variation using sapply

mtcars[sapply(mtcars, function(x) any(x > 3))]

subset rows with (1) ALL and (2) ANY columns larger than a specific value

See functions all() and any() for the first and second parts of your questions respectively. The apply() function can be used to run functions over rows or columns. (MARGIN = 1 is rows, MARGIN = 2 is columns, etc). Note I use apply() on df[, -1] to ignore the id variable when doing the comparisons.

Part 1:

> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17

Part 2:

> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
5 5 7 11

To see what is going on, x > 10 returns a logical vector for each row (via apply() indicating whether each element is greater than 10. all() returns TRUE if all element of the input vector are TRUE and FALSE otherwise. any() returns TRUE if any of the elements in the input is TRUE and FALSE if all are FALSE.

I then use the logical vector resulting from the apply() call

> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE TRUE FALSE TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE TRUE FALSE TRUE TRUE

to subset df (as shown above).

R select entire columns where at least one value meets a condition

Try this:

> set.seed(007) # for the example being reproducible
> X <- matrix(rnorm(100), 20) # generating some data
> X <- cbind(X, runif(20, max=.48)) # generating a column with all values < 0.5
> colnames(X) <- paste('col', 1:ncol(X), sep='') # some column names
> X # this is how the matrix looks like
col1 col2 col3 col4 col5 col6
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350 0.335107187
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236 0.419502015
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842 0.346358090
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429 0.212185020
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563 0.224824248
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865 0.415837389
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592 0.057660111
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588 0.007812921
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487 0.298192099
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672 0.216225091
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434 0.026097800
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474 0.190567072
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169 0.402829397
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598 0.248196976
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899 0.406511129
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739 0.162457572
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204 0.383801555
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132 0.347037954
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899 0.262938992
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971 0.139233120
>
> # defining a index for selecting if the condition is met
> ind <- apply(X, 2, function(X) any(abs(X)>0.5))
> X[,ind] # since col6 only has values less than 0.5 it is not taken
col1 col2 col3 col4 col5
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971

# It could be done just in one step avoiding 'ind'
X[, apply(X, 2, function(X) any(abs(X)>0.5))]

Filter data but keep at least one row for each ID

You can use tidyr::complete():

df %>%
filter(col1 == 1 | col2 == 1) %>%
tidyr::complete(id = df$id, fill = list(col3 = "-"))

# # A tibble: 4 × 4
# id col1 col2 col3
# <chr> <dbl> <dbl> <chr>
# 1 a 1 0 A
# 2 a 1 1 B
# 3 b NA NA -
# 4 c 0 1 E


Related Topics



Leave a reply



Submit