data.table row-wise sum, mean, min, max like dplyr?
You can use an efficient row-wise functions from matrixStats
package.
library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]
dt
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79
# 3: Spinach FR 40 NA 79 49 40 79 56.00000 168
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0
For dataset with 500000 rows(using the data.table
from CRAN)
dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
# user system elapsed
# 0.089 0.004 0.093
rowwise
(or by=1:nrow(dt)
) is "euphemism" for for loop
, as exemplified by
library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
# user system elapsed
# 80.832 0.111 80.974
system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
# user system elapsed
# 141.492 0.196 141.757
dplyr rowwise sum and other functions like max
In short: you are expecting the "sum" function to be aware of dplyr
data structures like a data frame grouped by row. sum
is not aware of it so it just takes the sum of the whole data.frame
.
Here is a brief explanation. This:
select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
Can be rewritten without using the pipe operator as the following:
data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)
As you can see you were constructing something called a tibble
. Then the rowwise
call adds additional information on this object and specifies that it should be grouped row-wise.
However only the functions aware of this grouping like summarize
and mutate
can work like intended. Base R functions like sum
are not aware of these objects and treat them as any standard data.frame
s. And the standard approach for sum()
is to sum the entire data frame.
Using mutate
works:
select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))
Result:
Source: local data frame [150 x 3]
Groups: <by row>
# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...
dplyr: max and min between a row and a specific value
library(tidyverse)
library(hms)
data <- tribble(
~Type, ~Start,
"A", "19:30:00",
"B", "18:45:00"
) %>%
mutate(Start = hms::as_hms(Start))
data %>%
mutate(
max_value = if_else(Start > as_hms("19:00:00"), Start, as_hms("19:00:00"))
)
# # A tibble: 2 x 3
# Type Start max_value
# <chr> <time> <time>
# 1 A 19:30 19:30
# 2 B 18:45 19:00
Return statistics like min or max from columns into rows with dplyr pipeline
I would suggest a mix of tidyverse
functions like next. You have to reshape your data, then aggregate with the summary functions you want and then as strategy you can re format again and obtain the expected output:
library(tidyverse)
sampleData %>% pivot_longer(cols = names(sampleData)) %>%
group_by(name) %>% summarise(Min=min(value,na.rm=T),
Max=max(value,na.rm=T)) %>%
rename(var=name) %>%
pivot_longer(cols = -var) %>%
pivot_wider(names_from = var,values_from=value)
The output:
# A tibble: 2 x 6
name A I O R U
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Min -2.21 0.197 1 1 1
2 Max 2.40 14.3 9 81 81
Check row-wise NA sum in R data.table
You can use rowSums
like this :
library(data.table)
dt[rowSums(!is.na(dt[, ..rel_cols])) > 0]
# x y z a
#1: A true <NA> <NA>
#2: C <NA> true <NA>
#3: D true true ha
Or using .SDcols
:
dt[dt[, rowSums(!is.na(.SD)) > 0, .SDcols = rel_cols]]
What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?
It seems like the processing time for the rowwise approach explodes quadratically:
Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise
approach.
df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(mean = mean(value),
var = var(value)) %>%
bind_cols(df, .)
Computing row-wise zscores in R dataframe
Using base R
df[4:6] <- t(scale(t(df[4:6]), center = TRUE, scale = TRUE))
Or with tidyverse
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(out = pmap(across(where(is.numeric)),
~ scale(c(...), center = TRUE, scale = TRUE)), .keep = 'unused') %>%
unnest_wider(out) %>%
mutate(across(D:G, c))
Preferred performant procedure for R data.table row-wise operations?
I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost
in a row-wise manner.
Below is a vectorized version of logpost
, i.e., logpost2
logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
bmat <- as.matrix(dd[, .(b1, b2, b3)])
xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
phi <- dd$phi
phi_log <- log(phi)
lp <- -(a + nrow(d) + 1) * phi_log -
(1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
(1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
lp
}
and you will see
> start <- Sys.time()
> grid[, lp := logpost2(d, .SD)]
> difftime(Sys.time(), start)
Time difference of 0.1966231 secs
and
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
Related Topics
Passing Several Arguments to Fun of Lapply (And Others *Apply)
Can Dplyr Join on Multiple Columns or Composite Key
Object Not Found Error with Ddply Inside a Function
Venn Diagram Proportional and Color Shading with Semi-Transparency
How to Use a List as a Hash in R? If So, Why Is It So Slow
Multiple Ggplots of Different Sizes
Why Use As.Factor() Instead of Just Factor()
Accept Http Request in R Shiny Application
How to Test If List Element Exists
Plot a Function with Ggplot, Equivalent of Curve()
Non-Redundant Version of Expand.Grid
One-Hot Encoding in [R] | Categorical to Dummy Variables
Use Ggpairs to Create This Plot
How to Maintain Size of Ggplot with Long Labels