How Does One Stop Using Rowwise in Dplyr

How does one stop using rowwise in dplyr?

As found in the comments and the other answer, the correct way of doing this is to use ungroup().

The operation rowwise(df) sets one of the classes of df to be rowwise_df. We can see the methods on this class by examining the code here, which gives the following ungroup method:

#' @export
ungroup.rowwise_df <- function(x) {
class(x) <- c( "tbl_df", "data.frame")
x
}

So we see that ungroup is not strictly removing a grouped structure, instead it just removes the rowwise_df class added from the rowwise function.

switch between row wise and normal column wise in dplyr

Return a named output which avoids the warnings. One way to do that is by using setNames.

library(dplyr)

df %>%
rowwise() %>%
mutate(output = list(setNames(isoreg(c_across(year1:year4))$yf,
paste0('col', 1:4)))) %>%
tidyr::unnest_wider(output) %>%
select(-starts_with('year'))


# A tibble: 6 x 5
# id col1 col2 col3 col4
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 14 14 30 40
#2 2 13 13 31 41
#3 3 12 12 32 42
#4 4 11 11 33 43
#5 5 10 10 34 44
#6 6 9 9 35 45

How do I process an entire row when using rowwise()?

Use the c_across function to help. For example

frm %>%
rowwise %>%
summarize(all_values_missing = all(is.na(c_across())))

If you only need a subset of columns the c_across() will accept tidy selectors as well.

Rowwise operations across columns

For a general solution add rowwise :

library(dplyr)

data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
rowwise() %>%
mutate(MAX_COLUMN = max(c_across(a:b)))

# a b MAX_COLUMN
# <int> <int> <int>
# 1 1 6 6
# 2 2 7 7
# 3 3 8 8
# 4 4 9 9
# 5 5 10 10
# 6 6 1 6
# 7 7 2 7
# 8 8 3 8
# 9 9 4 9
#10 10 5 10

If you want to take max a faster option would be pmax with do.call.

data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
mutate(MAX_COLUMN = do.call(pmax, .))

R group rows conditional by rowwise comparisons in a scalable way

Here are two solutions with dplyr and data.table respectively. Each package vectorizes its operations, so these solutions should be far faster than your loop; and the data.table solution should be the fastest of them all.

Let me know how each solution works for you!

Note

To identify the group to which each row belongs, we use the earliest row that it "matches"; where "matching" rows are defined as those that

share the same value in either start<>end, end<>start, start<>start, or end<>end and have a matching value (>0) in the related start_sep and end_sep column.

For a smaller dataset, it would be simple enough to perform a CROSS JOIN and then filter by your criteria. However, for a dataset with over 1 million rows, its CROSS JOIN would easily max out the available memory at over 1 trillion rows, so I had to find a different technique.

To wit, I use paste0() to generate "artificial" keys. Here start and start_sep are combined into start_label, while end and end_sep are combined into end_label. Now we can directly match() on a single column like start_label; rather than sifting every possible match across a set of columns like {start, start_sep}.

This approach assumes that in those * and *_sep columns:

  1. every distinct value can be represented as a distinct string;
  2. the separator "|" is absent from that string.

Solution 1: dplyr

Once you load dplyr

library(dplyr)


# ...
# Code to generate 'df'.
# ...

this workflow should do the trick. Note that group IDs must be calculated before the JOIN; since cur_group_id() would otherwise "misidentify" the NAs as a group unto themselves.

df %>%
mutate(
# Create an artificial key for matching.
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),

# Identify the earliest row where each match is found.
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%

# Include only rows meeting the criteria: remove any...
filter(
# ...without a match...
# |-------------------------------------------|
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
# |-----------|
# ...that corresponds to a positive '*_sep'.
) %>%

# For each row, identify the earliest of ALL its matches.
mutate(
match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)
) %>%

# Keep only the 'id' of each row, along with a 'group_id' for its earliest match.
group_by(match_id) %>%
transmute(
id,
group_id = cur_group_id()
) %>%
ungroup() %>%

# Map the original rows to their 'group_id's; with blanks (NAs) for no match.
right_join(df, by = "id") %>%

# Format into final form.
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)

Results

Please note that your sample data is inconsistent, so I have reconstructed my own df:

df <- structure(list(
id = 1:9,
start = c("A", "B", "D", "D", "E", "F", "A", "O", "A"),
start_sep = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L),
end = c("F", "G", "H", "J", "K", "L", "O", "P", "P"),
end_sep = c(1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)
),
class = "data.frame",
row.names = c(NA, -9L)
)

Given said df, the workflow should yield the following tibble:

# A tibble: 9 x 6
id start start_sep end end_sep group_id
<int> <chr> <int> <chr> <int> <int>
1 1 A 1 F 1 1
2 2 B 0 G 0 NA
3 3 D 1 H 0 2
4 4 D 1 J 0 2
5 5 E 0 K 0 NA
6 6 F 1 L 0 1
7 7 A 0 O 1 3
8 8 O 1 P 0 3
9 9 A 1 P 0 1

Solution 2: data.table

Here is essentially the same logic, but implemented in data.table.

library(data.table)


# ...
# Code to generate 'df'.
# ...


# Convert 'df' to a data.table.
df <- as.data.table(df)

Again, note that group IDs must be calculated before the JOIN; since .GRP would otherwise "misidentify" the NAs as a group unto themselves.

# Use 'id' as the key for efficient JOINs.
setkey(df, id

# Calculate the label and matching columns as before.
)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )

# Filter by criteria as before.
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))

# Generate the 'group_id' as before.
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][,
("group_id") := .GRP,
by = .(match_id)

# Perform the mapping (RIGHT JOIN) as before...
][
df,
# ...and select the desired columns.
.(id, start, start_sep, end, end_sep, group_id)
]

Results

With df as before, this solution should yield the following data.table:

   id start start_sep end end_sep group_id
1: 1 A 1 F 1 1
2: 2 B 0 G 0 NA
3: 3 D 1 H 0 2
4: 4 D 1 J 0 2
5: 5 E 0 K 0 NA
6: 6 F 1 L 0 1
7: 7 A 0 O 1 3
8: 8 O 1 P 0 3
9: 9 A 1 P 0 1

Performance

At scale, the data.table solution should be proportionately faster than the dplyr solution; but both should be quite fast.

On the massive dataset big_data, a data.frame with over 1 million rows

# Find every combination of variables...
big_df <- expand.grid(
start = LETTERS,
start_sep = 0:1,
end = LETTERS,
end_sep = 0:1
)

# ...and repeat until there are (at least) 1 million...
n_comb <- nrow(big_df)
n_rep <- ceiling(1000000/n_comb)

# ...with unique IDs.
big_df <- data.frame(
id = 1:(n_comb * n_rep),
start = rep(big_df$start , n_rep),
start_sep = rep(big_df$start_sep, n_rep),
end = rep(big_df$end , n_rep),
end_sep = rep(big_df$end_sep , n_rep)
)

we can measure the relative performances of each solution at scale

library(microbenchmark)

performances <- microbenchmark(
# Repeat test 50 times, for reliability.
times = 50,

# Solution 1: "dplyr".
solution_1 = {
big_df %>%
mutate(
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%
filter(
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
) %>%
mutate(match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)) %>%
group_by(match_id) %>%
transmute(id, group_id = cur_group_id()) %>%
ungroup() %>%
right_join(big_df, by = "id") %>%
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)
},

# Solution 2: "data.table"
solution_2 = {
big_dt <- as.data.table(big_df)

setkey(big_dt, id)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][, ("group_id") := .GRP, by = .(match_id)
][big_dt, .(id, start, start_sep, end, end_sep, group_id)]
}
)

which I have tabulated here:

#> performances

Unit: milliseconds
expr min lq mean median uq max neval
solution_1 880.1443 972.9289 1013.2868 997.5746 1059.9192 1186.8743 50
solution_2 581.2570 606.7222 649.9858 650.2422 679.4404 734.3966 50

By converting from time to speed

library(formattable)

performances %>%
as_tibble() %>%
group_by(expr) %>%
summarize(t_mean = mean(time)) %>%
transmute(
solution = expr,
# Invert time to get speed; and normalize % by longest time.
advantage = percent(max(t_mean)/t_mean - 1)
)

we estimate that the data.table solution is (on average) about 50% faster than the dplyr solution.

# A tibble: 2 x 2
solution advantage
<fct> <formttbl>
1 solution_1 0.00%
2 solution_2 55.89%

Apply `dplyr::rowwise` in all variables

This can be done using purrr::pmap, which passes a list of arguments to a function that accepts "dots". Since most functions like mean, sd, etc. work with vectors, you need to pair the call with a domain lifter:

df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(mean)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 68.48282
# 2 49.40462 47.00752 21.99248 78.87789 49.32063

df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(sd)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 13.88555
# 2 49.40462 47.00752 21.99248 78.87789 23.27958

The function sum accepts dots directly, so you don't need to lift its domain:

df_1 %>% select(-y) %>% mutate( var = pmap(., sum) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 197.2825

Everything conforms to the standard dplyr data processing, so all three can be combined as separate arguments to mutate:

df_1 %>% select(-y) %>% 
mutate( v1 = pmap(., lift_vd(mean)),
v2 = pmap(., lift_vd(sd)),
v3 = pmap(., sum) )
# x.1 x.2 x.3 x.4 v1 v2 v3
# 1 70.12072 62.99024 54.00672 86.81358 68.48282 13.88555 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 49.32063 23.27958 197.2825

sample with dplyr and rowwise


The very first row shows that col_1 and col_2 are different, while I
expect them to be the same.

set.seed(7) makes sure that every time you run your script, it will create the same my_df. It does not mean that every single time you run sample, it will sample the same number, so col_1 and col_2 do not need to be the same. However, if you run your code twice, both will get you the same col_1.

I expect col_1 and col_2 be sampled from set_diff column.

From the documentation of sample: If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Therefore, if set_diff equals 3, a sample is drawn from c(1,2,3).

Dplyr rowwise access entire column

We can use vapply

foo$nextIndex <- vapply(foo$B, function(x) which(foo$A-x>0)[1], 1)
foo
# A B nextIndex
#1 1 2 3
#2 2 2 3
#3 3 3 4
#4 4 4 5
#5 5 4 5

Or another option if the values are in order

findInterval(foo$B, foo$A)+1L
#[1] 3 3 4 5 5

Using it in the dplyr chain

foo %>% 
mutate(rowIndex = findInterval(B, A)+1L)

Dplyr rowwise not working on unnamed position identifiers

The reason column notation .[[1]] returns all values even during the grouping is is that . is not actually grouped. Basically, . is the same thing as the dataset you started with. So, when you call .[[1]], you are essentially accessing all the values in the first column.

You may have to mutate the data and add a row_number column. This allows you to index the columns you are mutating at their corresponding row numbers. The following should do:

data %>%
mutate(rn = row_number()) %>%
rowwise() %>%
mutate(min_time = min(.[[1]][rn], .[[5]][rn])) %>%
select(-rn)

Should yield:

#    Sch1  Sch2  Sch3  Sch4  Sch5 Student.ID min_time
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 99 292 252 859 360 Ben 99
# 2 1903 248 267 146 36 Bob 36
# 3 367 446 465 360 243 Ali 243


Related Topics



Leave a reply



Submit