R Keep Rows with at Least One Column Greater Than Value

R keep rows with at least one column greater than value

You can use rowSums to construct the condition in base R:

df[rowSums(df > 10) >= 1, ]

with dplyr (0.7.0), now you can use filter_all like this:

library(dplyr)
filter_all(df, any_vars(. > 10))

Subsetting Rows with a Column Value Greater than a Threshold

We can use rowSums

data[rowSums(data[5:70] > 7) > 0, ]

Or with subset

subset(data, rowSums(data[5:70] > 7) > 0)

We can also use filter_at from dplyr with any_vars

library(dplyr)
data %>% filter_at(vars(5:70), any_vars(. > 7))

Using reproducible data from mtcars (stealing idea from @Maurits Evers)

mtcars[rowSums(mtcars[3:11] > 300) > 0, ]

#                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Hornet Sportabout   18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
#Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
#Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
#Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
#Dodge Challenger    15.5   8  318 150 2.76 3.520 16.87  0  0    3    2
#AMC Javelin         15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
#Pontiac Firebird    19.2   8  400 175 3.08 3.845 17.05  0  0    3    2
#Ford Pantera L      15.8   8  351 264 4.22 3.170 14.50  0  1    5    4
#Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

Using filter_at also gives the same output

mtcars %>% filter_at(vars(3:11), any_vars(. > 300))

How to filter the rows that have at least one column greater than a threshold?

We can use rowSums directly

df[rowSums(df[2:4] >= 0.5) > 0, ]

#  Name  Clust1  Clust2   Clust3
#2   BB 0.76946 0.03242 0.029358
#3   CC 0.10990 0.52171 0.283859

Or dplyr version with filter_at and any_vars

library(dplyr)
df %>%
  filter_at(vars(starts_with("Clust")), any_vars(. >= 0.5))

and as far as fixing your code is concerned as mentioned by @thelatemail you are including column 1 in rowSums which is the Name column, so you want to subset it on columns 2:4. Also we can directly filter instead of creating new variable with mutate, so the following should work.

df %>% filter(rowSums(.[,c(2:4)] >= 0.5) > 0)

We can also use apply version which would be slow for larger datasets

df[apply(df[2:4] >= 0.5, 1, any), ]

Keep only rows if number is greater than... in specific column

Not entirely sure I understood your problem statement correctly, but perhaps something like this

library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
#       Seq          Change     Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg      2

Or the same in base R

exp_data[grep("\\d{3}", exp_data$Change), ]
#       Seq          Change     Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg      2

The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.

filter rows that have X number of columns with value greater than Y

A simple and quick way using base R rowSums, where we filter those rows which has value greater than 5.5 in more than one column.

df[rowSums(df > 5.5) > 1, ]

#                    tumor     tumor     tumor     tumor     tumor
#A_33_P3390097    5.698576  5.797294  5.671845  5.961686  5.751165
#GE_BrightCorner 15.596546 15.833930 16.165919 15.274273 16.018045
#NM_001166137     6.432376  6.062449  6.674411  6.475263  6.856038

How to select column values based on a greater than condition in row values

We can create a logical vector by comparing the dataframe with 3 and then take sum of columns using colSums and select only those columns which has at least one value greater than 3 in it.

mtcars[colSums(mtcars > 3) > 0]

#                     mpg cyl  disp  hp drat    wt  qsec gear carb
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46    4    4
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02    4    4
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61    4    1
#Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44    3    1
#....

Variation using sapply

mtcars[sapply(mtcars, function(x) any(x > 3))]

subset rows with (1) ALL and (2) ANY columns larger than a specific value

See functions all() and any() for the first and second parts of your questions respectively. The apply() function can be used to run functions over rows or columns. (MARGIN = 1 is rows, MARGIN = 2 is columns, etc). Note I use apply() on df[, -1] to ignore the id variable when doing the comparisons.

Part 1:

> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
  id v1 v2
2  2 15 32
4  4 12 17

Part 2:

> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
  id v1 v2
2  2 15 32
4  4 12 17
5  5  7 11

To see what is going on, x > 10 returns a logical vector for each row (via apply() indicating whether each element is greater than 10. all() returns TRUE if all element of the input vector are TRUE and FALSE otherwise. any() returns TRUE if any of the elements in the input is TRUE and FALSE if all are FALSE.

I then use the logical vector resulting from the apply() call

> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE  TRUE FALSE  TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE  TRUE FALSE  TRUE  TRUE

to subset df (as shown above).

R select entire columns where at least one value meets a condition

Try this:

> set.seed(007) # for the example being reproducible
> X <- matrix(rnorm(100), 20) # generating some data
> X <- cbind(X, runif(20, max=.48)) # generating a column with all values < 0.5
> colnames(X) <- paste('col', 1:ncol(X), sep='') # some column names
> X # this is how the matrix looks like
              col1        col2         col3        col4         col5        col6
 [1,]  2.287247161  0.83975036  1.218550535  0.07637147  0.342585350 0.335107187
 [2,] -1.196771682  0.70534183 -0.699317079  0.15915528  0.004248236 0.419502015
 [3,] -0.694292510  1.30596472 -0.285432752  0.54367418  0.029219842 0.346358090
 [4,] -0.412292951 -1.38799622 -1.311552673  0.70480735 -0.393423429 0.212185020
 [5,] -0.970673341  1.27291686 -0.391012431  0.31896914 -0.792704563 0.224824248
 [6,] -0.947279945  0.18419277 -0.401526613  1.10924979 -0.311701865 0.415837389
 [7,]  0.748139340  0.75227990  1.350517581  0.76915419 -0.346068592 0.057660111
 [8,] -0.116955226  0.59174505  0.591190027  1.15347367 -0.304607588 0.007812921
 [9,]  0.152657626 -0.98305260  0.100525456  1.26068350 -1.785893487 0.298192099
[10,]  2.189978107 -0.27606396  0.931071996  0.70062351  0.587274672 0.216225091
[11,]  0.356986230 -0.87085102 -0.262742349  0.43262716  1.635794434 0.026097800
[12,]  2.716751783  0.71871055 -0.007668105 -0.92260172 -0.645423474 0.190567072
[13,]  2.281451926  0.11065288  0.367153007 -0.61558421  0.618992169 0.402829397
[14,]  0.324020540 -0.07846677  1.707162545 -0.86665969  0.236393598 0.248196976
[15,]  1.896067067 -0.42049046  0.723740263 -1.63951709  0.846500899 0.406511129
[16,]  0.467680511 -0.56212588  0.481036049 -1.32583924 -0.573645739 0.162457572
[17,] -0.893800723  0.99751344 -1.567868244 -0.88903673  1.117993204 0.383801555
[18,] -0.307328300 -1.10513006  0.318250283 -0.55760233 -1.540001132 0.347037954
[19,] -0.004822422 -0.14228783  0.165991451 -0.06240231 -0.438123899 0.262938992
[20,]  0.988164149  0.31499490 -0.899907630  2.42269298 -0.150672971 0.139233120
> 
> # defining a index for selecting if the condition is met
> ind <- apply(X, 2, function(X) any(abs(X)>0.5))  
> X[,ind] # since col6 only has values less than 0.5 it is not taken
              col1        col2         col3        col4         col5
 [1,]  2.287247161  0.83975036  1.218550535  0.07637147  0.342585350
 [2,] -1.196771682  0.70534183 -0.699317079  0.15915528  0.004248236
 [3,] -0.694292510  1.30596472 -0.285432752  0.54367418  0.029219842
 [4,] -0.412292951 -1.38799622 -1.311552673  0.70480735 -0.393423429
 [5,] -0.970673341  1.27291686 -0.391012431  0.31896914 -0.792704563
 [6,] -0.947279945  0.18419277 -0.401526613  1.10924979 -0.311701865
 [7,]  0.748139340  0.75227990  1.350517581  0.76915419 -0.346068592
 [8,] -0.116955226  0.59174505  0.591190027  1.15347367 -0.304607588
 [9,]  0.152657626 -0.98305260  0.100525456  1.26068350 -1.785893487
[10,]  2.189978107 -0.27606396  0.931071996  0.70062351  0.587274672
[11,]  0.356986230 -0.87085102 -0.262742349  0.43262716  1.635794434
[12,]  2.716751783  0.71871055 -0.007668105 -0.92260172 -0.645423474
[13,]  2.281451926  0.11065288  0.367153007 -0.61558421  0.618992169
[14,]  0.324020540 -0.07846677  1.707162545 -0.86665969  0.236393598
[15,]  1.896067067 -0.42049046  0.723740263 -1.63951709  0.846500899
[16,]  0.467680511 -0.56212588  0.481036049 -1.32583924 -0.573645739
[17,] -0.893800723  0.99751344 -1.567868244 -0.88903673  1.117993204
[18,] -0.307328300 -1.10513006  0.318250283 -0.55760233 -1.540001132
[19,] -0.004822422 -0.14228783  0.165991451 -0.06240231 -0.438123899
[20,]  0.988164149  0.31499490 -0.899907630  2.42269298 -0.150672971

# It could be done just in one step avoiding 'ind'
X[, apply(X, 2, function(X) any(abs(X)>0.5))]

Filter data but keep at least one row for each ID

You can use tidyr::complete():

df %>%
  filter(col1 == 1 | col2 == 1) %>%
  tidyr::complete(id = df$id, fill = list(col3 = "-"))

# # A tibble: 4 × 4
#   id     col1  col2 col3 
#   <chr> <dbl> <dbl> <chr>
# 1 a         1     0 A    
# 2 a         1     1 B    
# 3 b        NA    NA -    
# 4 c         0     1 E

R Keep Rows with at Least One Column Greater Than Value