Efficient Datatable Group By

c# DataTable: More efficient Group By and Sum?

//your method
public void YourMethod()
{
Dictionary<int, int> result = new Dictionary<int, int>();

int length = 0;

if(dt1.Rows.Count > dt2.Rows.Count)
length = dt1.Rows.Count
else
length = dt2.Rows.Count

for(int i=0; i < length - 1; i++)
{
AddRowValue(dt1, result, i);
AddRowValue(dt2, result, i);
}

}

public AddRowValue(DataTable tbl, Dictionary<int, int> dic, int index)
{
if( index > tbl.Rows.Count)
return;

DataRow row = tbl.Rows[index];

int idValue = Convert.ToInt32(row["ID"]);
int quantityValue = Convert.ToInt32(row["Quantity"]);

if(dic.Keys.Contains(idValue)
dic[idValue] = dic[idValue] + quantityValue;
else
dic.Add(idValue, quantityValue);
}

You need something like this, you can use dictionary at the end the result will be stored in the dictionary.

Efficient way to group data.table results by year

When you are doing aggregate operations (grouping) with data.tables, especially for large data sets, you should set the field you are grouping by as a key (using setkeyv(DT, "your_key_field"), etc...). Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table:: functions / operations within your data.table object than you would when using other packages' functions, like plyr::count for example. Below, I made a few data.table objects - the first is identical to your example; the second adds a column Year (instead of calculating format(Date,"%Y") at the time of function execution), but sets Date as the key; and the third is the same as the second, except that it uses Year as the key. I also made a few functions (for benchmarking convenience) that do the grouping in different ways.

library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
from=as.Date("2000-01-01"),
to=as.Date("2012-12-31"),
by="day")
##
set.seed(123)
sampleDate <- sample(
dates,
1e06,
replace=TRUE)
##
DT.dt <- data.table(
Date=sampleDate,
incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
Date incident
1: 2003-09-27 1
2: 2010-04-01 1
3: 2005-04-26 1
> head(DT.dt2,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000
> head(DT.dt3,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000

## your original method
f1 <- function(dt)
{
dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
dt[,count(Year)]
}
## use 'Date' column; '.N' and
## 'by=' instead of 'count'
f2 <- function(dt)
{
dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
dt[,.N,by=Year]
}
##
Res <- microbenchmark(
f1(DT.dt),
f1.2(DT.dt2),
f1.2(DT.dt3),
f2(DT.dt2),
f3(DT.dt3))
##
> Res
Unit: milliseconds
expr min lq median uq max neval
f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724 100
f1.2(DT.dt2) 98.722062 115.588034 126.332104 137.792116 223.4967 100
f1.2(DT.dt3) 97.475673 118.134788 125.836817 136.136156 238.2697 100
f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674 100
f3(DT.dt3) 7.912803 8.441159 8.736887 9.685267 76.9629 100

Observations:

  1. Grouping by the precalculated field Year instead of calculating
    format(Date,"%Y") at execution time was a definite improvement -
    for both of the count and .N approaches. You can see this by
    comparing the f1() and f2() times to the f1.2() times.

  2. The count approach seemed to be a little slower than the .N & 'by=' approach (f1() compared to f2().

  3. The best approach by far was to use the precalculated field Year and the native data.table grouping .N & by=; f3() was considerably faster than the other four timings.

There are some pretty experience data.table users on SO, certainly more so than myself, so there may be an even faster way to do this. All else aside, though, it's definitely a good idea to set a key on your data.table; and it certainly seems like you would be much better off precalculating a field like Year than doing so "on the fly"; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL].

Also, you said you are trying to count the number of incidents per year - and since your example data had incident = 1 for all rows, counting was the same as summing. But assuming your real data has different values of incident, you could so something like this:

> DT.dt3[,list(Incidents=sum(incident)),by=Year]
Year Incidents
1: 2000 77214
2: 2001 77385
3: 2002 77080
4: 2003 76609
5: 2004 77197
6: 2005 76994
7: 2006 76560
8: 2007 76904
9: 2008 76786
10: 2009 76765
11: 2010 76675
12: 2011 76868
13: 2012 76963

(where I called setkeyv(DT.dt3,cols="Year") above).

Efficient DataTable Group By

You may use Linq.

var result = from row in dt.AsEnumerable()
group row by row.Field<int>("TeamID") into grp
select new
{
TeamID = grp.Key,
MemberCount = grp.Count()
};
foreach (var t in result)
Console.WriteLine(t.TeamID + " " + t.MemberCount);

Performance of Grouped Operations with data.table

I think your suspicion is correct.
As to why, we could make an educated guess.

EDIT: the information below ignores what data.table does with GForce optimizations,
the functions GForce supports probably avoid copies of data similar to your C++ code but,
as Frank mentioned,
:= didn't try to detect GForce-optimizable expressions before 1.14.3,
and addrs below is certainly not optimized by GForce.

The C++ code you wrote doesn't modify anything from its input data,
it only has to allocate out.
As you correctly said,
data.table aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z according to the group indices,
effectively copying some of the input multiple times.

I wanted to try to validate my assumption,
so I wrote some C++ code to look at memory addresses:

set.seed(31L)
DT <- data.table(w = sample(1:3, size = 20, replace = TRUE),
x = sample(1:3, size = 20, replace = TRUE),
y = sample(1:3, size = 20, replace = TRUE),
z = runif(n = 20, min = 0, max = 1))

setkey(DT, w, x, y)

cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
StringVector ans(indices.size());
for (auto i = 0; i < indices.size(); ++i) {
std::ostringstream oss;
oss << "Group" << group_ids[i] << " " << &z[indices[i]];
ans[i] = oss.str();
}
return ans;
}
')

idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]

DT[, list(addrs(z, idx$indices, idx$group_ids))]
# V1
# 1: Group1 0x55b7733361f8
# 2: Group2 0x55b773336200
# 3: Group3 0x55b773336210
# 4: Group4 0x55b773336220
# 5: Group5 0x55b773336228
# 6: Group6 0x55b773336230
# 7: Group7 0x55b773336238
# 8: Group8 0x55b773336248
# 9: Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290

As expected here,
if we look at z as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 + 0x8.
You can execute the last line from the previous code multiple times and it will always show the same addresses.

What I partially didn't expect is this:

DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
# w x y V1
# 1: 1 1 2 Group1 0x55b771b03440
# 2: 1 1 3 Group2 0x55b771b03440
# 3: 1 2 1 Group3 0x55b771b03440
# 4: 1 2 2 Group4 0x55b771b03440
# 5: 1 3 2 Group5 0x55b771b03440
# 6: 1 3 3 Group6 0x55b771b03440
# 7: 2 1 1 Group7 0x55b771b03440
# 8: 2 2 1 Group8 0x55b771b03440
# 9: 2 2 2 Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440

On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(ツ)_/¯

EDIT: I added the following as the first line of addrs
(double escape because of R parsing)

Rcout << "input length = " << z.size() << "\\n";

and if I run the last DT[...] code from above,
it does print different lengths even though the address is the same.

Efficient way of adding row with difference from specific group in data table

You could do

require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][]
# cat ind counts
# 1: A 1 10
# 2: A 2 9
# 3: A 3 8
# 4: A 0 0
# 5: B 4 7
# 6: B 5 6
# 7: B 0 14
# 8: C 6 5
# 9: C 7 4
# 10: C 0 18
# 11: D 8 3
# 12: D 0 24

What is the most efficient method for finding row indices by group in a data.table in R?

Using .I, this option returns a data.table with two columns. The first column is the unique values in a, and the second is a list of indices where each value appears in k. The form is different than the OP's m, but the information is all there and just as easily accessible.

k[, .(idx = .(.I)), a]

Benchmarking:

library(data.table)

k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))

microbenchmark::microbenchmark(
A = {
u <- unique(k$a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
B = {
m2 <- k[, .(idx = .(.I)), a]
},
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929 100
#> B 9.7870 10.3598 13.04379 10.8292 12.73785 65.4864 100

all.equal(m, m2$idx)
#> [1] TRUE

all.equal(u, m2$a)
#> [1] TRUE

Efficiently Repeating Observations by Group

A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:

merge(DT, data.frame(x=rep), by="x")

x y
1: A 1
2: A 1
3: A 1
4: A 2
5: A 2
6: A 2
7: B 3
8: B 3
9: B 4
10: B 4
11: C 5
12: C 6

Efficient filtering through multiple columns by group

Here is an alternative tidyverse approach.

my_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(!grepl("X", paste(var1, var2, var3, collapse = ""))) %>%
ungroup()
}

my_fun(df)

# # A tibble: 2 x 4
# id var1 var2 var3
# <int> <chr> <chr> <chr>
# 1 3 Z1 Z1 Z1
# 2 3 Z2 Z2 Z2

df_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`))) %>%
ungroup()
}

performance <- bench::mark(
my_fun(df),
df_fun(df)
)

performance %>% select(1:4)

# # A tibble: 2 x 4
# expression min median `itr/sec`
# <bch:expr> <bch:tm> <bch:tm> <dbl>
# 1 my_fun(df) 2.6ms 2.7ms 364.
# 2 df_fun(df) 6.01ms 6.39ms 152.


Related Topics



Leave a reply



Submit