R keep rows with at least one column greater than value
You can use rowSums
to construct the condition in base R:
df[rowSums(df > 10) >= 1, ]
with dplyr
(0.7.0), now you can use filter_all
like this:
library(dplyr)
filter_all(df, any_vars(. > 10))
Subsetting Rows with a Column Value Greater than a Threshold
We can use rowSums
data[rowSums(data[5:70] > 7) > 0, ]
Or with subset
subset(data, rowSums(data[5:70] > 7) > 0)
We can also use filter_at
from dplyr
with any_vars
library(dplyr)
data %>% filter_at(vars(5:70), any_vars(. > 7))
Using reproducible data from mtcars
(stealing idea from @Maurits Evers)
mtcars[rowSums(mtcars[3:11] > 300) > 0, ]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
#Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
#Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
#AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
#Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
#Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
#Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
#Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
Using filter_at
also gives the same output
mtcars %>% filter_at(vars(3:11), any_vars(. > 300))
How to filter the rows that have at least one column greater than a threshold?
We can use rowSums
directly
df[rowSums(df[2:4] >= 0.5) > 0, ]
# Name Clust1 Clust2 Clust3
#2 BB 0.76946 0.03242 0.029358
#3 CC 0.10990 0.52171 0.283859
Or dplyr
version with filter_at
and any_vars
library(dplyr)
df %>%
filter_at(vars(starts_with("Clust")), any_vars(. >= 0.5))
and as far as fixing your code is concerned as mentioned by @thelatemail you are including column 1 in rowSums
which is the Name
column, so you want to subset it on columns 2:4
. Also we can directly filter
instead of creating new variable with mutate
, so the following should work.
df %>% filter(rowSums(.[,c(2:4)] >= 0.5) > 0)
We can also use apply
version which would be slow for larger datasets
df[apply(df[2:4] >= 0.5, 1, any), ]
Keep only rows if number is greater than... in specific column
Not entirely sure I understood your problem statement correctly, but perhaps something like this
library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
Or the same in base R
exp_data[grep("\\d{3}", exp_data$Change), ]
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
The idea is to use a regular expression to keep only those rows where Change
contains at least one three-digit expression.
filter rows that have X number of columns with value greater than Y
A simple and quick way using base R rowSums
, where we filter those rows which has value greater than 5.5 in more than one column.
df[rowSums(df > 5.5) > 1, ]
# tumor tumor tumor tumor tumor
#A_33_P3390097 5.698576 5.797294 5.671845 5.961686 5.751165
#GE_BrightCorner 15.596546 15.833930 16.165919 15.274273 16.018045
#NM_001166137 6.432376 6.062449 6.674411 6.475263 6.856038
How to select column values based on a greater than condition in row values
We can create a logical vector by comparing the dataframe with 3 and then take sum of columns using colSums
and select only those columns which has at least one value greater than 3 in it.
mtcars[colSums(mtcars > 3) > 0]
# mpg cyl disp hp drat wt qsec gear carb
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
#....
Variation using sapply
mtcars[sapply(mtcars, function(x) any(x > 3))]
subset rows with (1) ALL and (2) ANY columns larger than a specific value
See functions all()
and any()
for the first and second parts of your questions respectively. The apply()
function can be used to run functions over rows or columns. (MARGIN = 1
is rows, MARGIN = 2
is columns, etc). Note I use apply()
on df[, -1]
to ignore the id
variable when doing the comparisons.
Part 1:
> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
Part 2:
> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
5 5 7 11
To see what is going on, x > 10
returns a logical vector for each row (via apply()
indicating whether each element is greater than 10. all()
returns TRUE
if all element of the input vector are TRUE
and FALSE
otherwise. any()
returns TRUE
if any of the elements in the input is TRUE
and FALSE
if all are FALSE
.
I then use the logical vector resulting from the apply()
call
> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE TRUE FALSE TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE TRUE FALSE TRUE TRUE
to subset df
(as shown above).
R select entire columns where at least one value meets a condition
Try this:
> set.seed(007) # for the example being reproducible
> X <- matrix(rnorm(100), 20) # generating some data
> X <- cbind(X, runif(20, max=.48)) # generating a column with all values < 0.5
> colnames(X) <- paste('col', 1:ncol(X), sep='') # some column names
> X # this is how the matrix looks like
col1 col2 col3 col4 col5 col6
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350 0.335107187
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236 0.419502015
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842 0.346358090
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429 0.212185020
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563 0.224824248
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865 0.415837389
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592 0.057660111
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588 0.007812921
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487 0.298192099
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672 0.216225091
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434 0.026097800
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474 0.190567072
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169 0.402829397
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598 0.248196976
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899 0.406511129
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739 0.162457572
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204 0.383801555
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132 0.347037954
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899 0.262938992
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971 0.139233120
>
> # defining a index for selecting if the condition is met
> ind <- apply(X, 2, function(X) any(abs(X)>0.5))
> X[,ind] # since col6 only has values less than 0.5 it is not taken
col1 col2 col3 col4 col5
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971
# It could be done just in one step avoiding 'ind'
X[, apply(X, 2, function(X) any(abs(X)>0.5))]
Filter data but keep at least one row for each ID
You can use tidyr::complete()
:
df %>%
filter(col1 == 1 | col2 == 1) %>%
tidyr::complete(id = df$id, fill = list(col3 = "-"))
# # A tibble: 4 × 4
# id col1 col2 col3
# <chr> <dbl> <dbl> <chr>
# 1 a 1 0 A
# 2 a 1 1 B
# 3 b NA NA -
# 4 c 0 1 E
Related Topics
Embedded Nul in String' Error When Importing CSV with Fread
Add Max Value to a New Column in R
Solution. How to Install_Github When There Is a Proxy
Convert Four Digit Year Values to Class Date
Print Unicode Character String in R
Setting Absolute Size of Facets in Ggplot2
R - Emulate the Default Behavior of Hist() with Ggplot2 for Bin Width
Fill Missing Combinations in a Dataframe
R: += (Plus Equals) and ++ (Plus Plus) Equivalent from C++/C#/Java, etc.
Percentage on Y Lab in a Faceted Ggplot Barchart
Adding Greek Character to Axis Title
Add (Subtract) Months Without Exceeding the Last Day of the New Month
Alternative to R's 'Memory.Size()' in Linux
How to Prevent Rbind() from Geting Really Slow as Dataframe Grows Larger