How to Preserve Base Data Frame Rownames Upon Filtering in Dplyr Chain

How to preserve base data frame rownames upon filtering in dplyr chain

you can convert rownames to a column and revert back after filtering:

library(dplyr)
library(tibble) # for `rownames_to_column` and `column_to_rownames`

df %>%
rownames_to_column('gene') %>%
filter_if(is.numeric, all_vars(. >= 8)) %>%
column_to_rownames('gene')

# BoneMarrow Pulmonary
# ATP1B1 30 3380
# PRR11 2703 27

How to mutate columns but keep rownames in R pipe?

That is because mutate or in general dplyr readjusts rownames from 1 after any operation hence, it does not maintain the original rownames.

If you need them for further manipulation store them as a column.

library(dplyr)

iris %>%
.[which(as.numeric(rownames(.))%%3!=0),] %>%
mutate(row = rownames(.),
Sepal.Length=Sepal.Length+1) %>%
pull(row)

# [1] "1" "2" "4" "5" "7" "8" "10" "11" "13" "14" "16" "17" "19" "20" "22" "23" "25" "26"
# [19] "28" "29" "31" "32" "34" "35" "37" "38" "40" "41" "43" "44" "46" "47" "49" "50" "52" "53"
# [37] "55" "56" "58" "59" "61" "62" "64" "65" "67" "68" "70" "71" "73" "74" "76" "77" "79" "80"
# [55] "82" "83" "85" "86" "88" "89" "91" "92" "94" "95" "97" "98" "100" "101" "103" "104" "106" "107"
# [73] "109" "110" "112" "113" "115" "116" "118" "119" "121" "122" "124" "125" "127" "128" "130" "131" "133" "134"
# [91] "136" "137" "139" "140" "142" "143" "145" "146" "148" "149"

how can I avoid rowSums() dropping rownames?

dplyr (or tidyverse in general) don't allow rownames.

A way to preserve rownames would be to add rownames as new column perform the data manipulation that you want and move the rownames back.

library(dplyr)
library(tibble)

x %>%
rownames_to_column() %>%
mutate(Total = rowSums(.[-1])) %>%
column_to_rownames()

# x1 x2 Total
#a 1 2 3
#b 0 4 4
#c 2 5 7
#d 3 0 3
#e 4 9 13

Filter rows in dplyr chain if a set of rows doesn't contain a specific word

We create a grouping column based on the condition that every fourth row is a new block (gl), then filter out the groups where the first element of 'name' is not a _number or _slider, then ungroup and remove the temporary 'grp' column created

library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 4, n()))) %>%
filter(!str_detect(first(name), "_(number|slider)")) %>%
ungroup %>%
select(-grp)

Update

Based on the comments from the OP i.e. blocks are determined by their common prefix, then extract the first word, use that as grouping variable and do the filter as before

library(stringr)
df %>%
group_by(grp = word(name, 1, sep="_")) %>%
filter(!str_detect(first(name), "_(number|slider)"))

and the ungroup part remains the same as previous

If there are repeating prefixes i.e. non-adjacent prefixes and needs to be considered as separate blocks, then use rleid from data.table to create the grouping variable

df %>%
group_by(grp = rleid(word(name, 1, sep="_"))) %>%
filter(!str_detect(first(name), "_(number|slider)"))

filtering data.frame based on row_number()

Actually dplyr's slice function is made for this kind of subsetting:

df %>% slice(2:7)

(I'm a little late to the party but thought I'd add this for future readers)

filter for complete cases in data.frame using dplyr (case-wise deletion)

Try this:

df %>% na.omit

or this:

df %>% filter(complete.cases(.))

or this:

library(tidyr)
df %>% drop_na

If you want to filter based on one variable's missingness, use a conditional:

df %>% filter(!is.na(x1))

or

df %>% drop_na(x1)

Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.

str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"

ADDED Have updated to reflect latest version of dplyr and comments.

ADDED Have updated to reflect latest version of tidyr and comments.

How to correctly write class methods in R6 and chain them

If you want to chain member functions, you need those member functions to return self. This means that the R6 object has to modify the data it contains. Since the benefit of R6 is to reduce copies, I would probably keep a full copy of the data, and have select_func and filter_func update some row and column indices:

library(R6)

dataFrame <- R6Class("dataFrame",
public = list(
data = data.frame(),
rows = 0,
columns = 0,
initialize = function(data) {
self$data <- data
self$rows <- seq(nrow(data))
self$columns <- seq_along(data)
},
get_data = function() {self$data[self$columns][self$rows,]},
select_func = function(cols) {
if(is.character(cols)) cols <- match(cols, names(self$data))
self$columns <- cols
self
},
filter_func = function(r) {
if(is.logical(r)) r <- which(r)
self$rows <- r
self
})
)

This allows us to chain the filter and select methods:

dataFrame$new(iris)$filter_func(1:5)$select_func(1:2)$get_data()
#> Sepal.Length Sepal.Width
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6

and our select method can take names too:

dataFrame$new(mtcars)$select_func(c("mpg", "wt"))$get_data()
#> mpg wt
#> Mazda RX4 21.0 2.620
#> Mazda RX4 Wag 21.0 2.875
#> Datsun 710 22.8 2.320
#> Hornet 4 Drive 21.4 3.215
#> Hornet Sportabout 18.7 3.440
#> Valiant 18.1 3.460
#> Duster 360 14.3 3.570
#> Merc 240D 24.4 3.190
#> Merc 230 22.8 3.150
#> Merc 280 19.2 3.440
#> Merc 280C 17.8 3.440
#> Merc 450SE 16.4 4.070
#> Merc 450SL 17.3 3.730
#> Merc 450SLC 15.2 3.780
#> Cadillac Fleetwood 10.4 5.250
#> Lincoln Continental 10.4 5.424
#> Chrysler Imperial 14.7 5.345
#> Fiat 128 32.4 2.200
#> Honda Civic 30.4 1.615
#> Toyota Corolla 33.9 1.835
#> Toyota Corona 21.5 2.465
#> Dodge Challenger 15.5 3.520
#> AMC Javelin 15.2 3.435
#> Camaro Z28 13.3 3.840
#> Pontiac Firebird 19.2 3.845
#> Fiat X1-9 27.3 1.935
#> Porsche 914-2 26.0 2.140
#> Lotus Europa 30.4 1.513
#> Ford Pantera L 15.8 3.170
#> Ferrari Dino 19.7 2.770
#> Maserati Bora 15.0 3.570
#> Volvo 142E 21.4 2.780

For completeness, you need some type safety, and I would also add a reset method to remove all filtering. This effectively gives you a data frame where the filtering and selecting are non-destructive, which could actually be very useful.

Created on 2022-05-01 by the reprex package (v2.0.1)

R: Select rows by value and always include previous row

Create a position index where 'time' value is 13 using which and then subtract 1 from the index and concatenate both to subset

i1 <- which(df1$time == 13) 
ind <- sort(unique(i1 - rep(c(1, 0), each = length(i1))))
ind <- ind[ind >0]
df1[ind,]

-output

  ID speed dist time
2 B 7 10 8
3 C 7 18 13
4 C 8 4 5
5 A 5 6 13
6 D 6 2 13

data

df1 <- structure(list(ID = c("A", "B", "C", "C", "A", "D", "E"), speed = c(4L, 
7L, 7L, 8L, 5L, 6L, 7L), dist = c(12L, 10L, 18L, 4L, 6L, 2L,
2L), time = c(4L, 8L, 13L, 5L, 13L, 13L, 9L)),
class = "data.frame", row.names = c(NA,
-7L))


Related Topics



Leave a reply



Submit