Data.Table Row-Wise Sum, Mean, Min, Max Like Dplyr

data.table row-wise sum, mean, min, max like dplyr?

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]

dt
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79
# 3: Spinach FR 40 NA 79 49 40 79 56.00000 168
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
# user system elapsed
# 0.089 0.004 0.093

rowwise (or by=1:nrow(dt)) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
# user system elapsed
# 80.832 0.111 80.974

system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
# user system elapsed
# 141.492 0.196 141.757

dplyr rowwise sum and other functions like max

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

Using mutate works:

select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))

Result:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...

dplyr: max and min between a row and a specific value


library(tidyverse)
library(hms)

data <- tribble(
~Type, ~Start,
"A", "19:30:00",
"B", "18:45:00"
) %>%
mutate(Start = hms::as_hms(Start))

data %>%
mutate(
max_value = if_else(Start > as_hms("19:00:00"), Start, as_hms("19:00:00"))
)


# # A tibble: 2 x 3
# Type Start max_value
# <chr> <time> <time>
# 1 A 19:30 19:30
# 2 B 18:45 19:00

Return statistics like min or max from columns into rows with dplyr pipeline

I would suggest a mix of tidyverse functions like next. You have to reshape your data, then aggregate with the summary functions you want and then as strategy you can re format again and obtain the expected output:

library(tidyverse)

sampleData %>% pivot_longer(cols = names(sampleData)) %>%
group_by(name) %>% summarise(Min=min(value,na.rm=T),
Max=max(value,na.rm=T)) %>%
rename(var=name) %>%
pivot_longer(cols = -var) %>%
pivot_wider(names_from = var,values_from=value)

The output:

# A tibble: 2 x 6
name A I O R U
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Min -2.21 0.197 1 1 1
2 Max 2.40 14.3 9 81 81

Check row-wise NA sum in R data.table

You can use rowSums like this :

library(data.table)
dt[rowSums(!is.na(dt[, ..rel_cols])) > 0]

# x y z a
#1: A true <NA> <NA>
#2: C <NA> true <NA>
#3: D true true ha

Or using .SDcols :

dt[dt[, rowSums(!is.na(.SD)) > 0, .SDcols = rel_cols]]

What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

It seems like the processing time for the rowwise approach explodes quadratically:

Sample Image

Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise approach.

df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(mean = mean(value),
var = var(value)) %>%
bind_cols(df, .)

Computing row-wise zscores in R dataframe

Using base R

df[4:6] <- t(scale(t(df[4:6]), center = TRUE, scale = TRUE))

Or with tidyverse

library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(out = pmap(across(where(is.numeric)),
~ scale(c(...), center = TRUE, scale = TRUE)), .keep = 'unused') %>%
unnest_wider(out) %>%
mutate(across(D:G, c))

Preferred performant procedure for R data.table row-wise operations?

I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost in a row-wise manner.


Below is a vectorized version of logpost, i.e., logpost2

logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
bmat <- as.matrix(dd[, .(b1, b2, b3)])
xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
phi <- dd$phi
phi_log <- log(phi)
lp <- -(a + nrow(d) + 1) * phi_log -
(1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
(1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
lp
}

and you will see

> start <- Sys.time()

> grid[, lp := logpost2(d, .SD)]

> difftime(Sys.time(), start)
Time difference of 0.1966231 secs

and

> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152


Related Topics



Leave a reply



Submit