c# DataTable: More efficient Group By and Sum?
//your method
public void YourMethod()
{
Dictionary<int, int> result = new Dictionary<int, int>();
int length = 0;
if(dt1.Rows.Count > dt2.Rows.Count)
length = dt1.Rows.Count
else
length = dt2.Rows.Count
for(int i=0; i < length - 1; i++)
{
AddRowValue(dt1, result, i);
AddRowValue(dt2, result, i);
}
}
public AddRowValue(DataTable tbl, Dictionary<int, int> dic, int index)
{
if( index > tbl.Rows.Count)
return;
DataRow row = tbl.Rows[index];
int idValue = Convert.ToInt32(row["ID"]);
int quantityValue = Convert.ToInt32(row["Quantity"]);
if(dic.Keys.Contains(idValue)
dic[idValue] = dic[idValue] + quantityValue;
else
dic.Add(idValue, quantityValue);
}
You need something like this, you can use dictionary at the end the result will be stored in the dictionary.
Efficient way to group data.table results by year
When you are doing aggregate operations (grouping) with data.tables
, especially for large data sets, you should set the field you are grouping by as a key
(using setkeyv(DT, "your_key_field")
, etc...). Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table::
functions / operations within your data.table
object than you would when using other packages' functions, like plyr::count
for example. Below, I made a few data.table
objects - the first is identical to your example; the second adds a column Year
(instead of calculating format(Date,"%Y")
at the time of function execution), but sets Date
as the key
; and the third is the same as the second, except that it uses Year
as the key
. I also made a few functions (for benchmarking convenience) that do the grouping in different ways.
library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
from=as.Date("2000-01-01"),
to=as.Date("2012-12-31"),
by="day")
##
set.seed(123)
sampleDate <- sample(
dates,
1e06,
replace=TRUE)
##
DT.dt <- data.table(
Date=sampleDate,
incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
Date incident
1: 2003-09-27 1
2: 2010-04-01 1
3: 2005-04-26 1
> head(DT.dt2,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000
> head(DT.dt3,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000
## your original method
f1 <- function(dt)
{
dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
dt[,count(Year)]
}
## use 'Date' column; '.N' and
## 'by=' instead of 'count'
f2 <- function(dt)
{
dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
dt[,.N,by=Year]
}
##
Res <- microbenchmark(
f1(DT.dt),
f1.2(DT.dt2),
f1.2(DT.dt3),
f2(DT.dt2),
f3(DT.dt3))
##
> Res
Unit: milliseconds
expr min lq median uq max neval
f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724 100
f1.2(DT.dt2) 98.722062 115.588034 126.332104 137.792116 223.4967 100
f1.2(DT.dt3) 97.475673 118.134788 125.836817 136.136156 238.2697 100
f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674 100
f3(DT.dt3) 7.912803 8.441159 8.736887 9.685267 76.9629 100
Observations:
Grouping by the precalculated field
Year
instead of calculatingformat(Date,"%Y")
at execution time was a definite improvement -
for both of thecount
and.N
approaches. You can see this by
comparing thef1()
andf2()
times to thef1.2()
times.The
count
approach seemed to be a little slower than the.N
& 'by=' approach (f1()
compared tof2()
.- The best approach by far was to use the precalculated field
Year
and the nativedata.table
grouping.N
&by=
;f3()
was considerably faster than the other four timings.
There are some pretty experience data.table
users on SO, certainly more so than myself, so there may be an even faster way to do this. All else aside, though, it's definitely a good idea to set a key
on your data.table
; and it certainly seems like you would be much better off precalculating a field like Year
than doing so "on the fly"; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL]
.
Also, you said you are trying to count the number of incident
s per year - and since your example data had incident = 1
for all rows, counting was the same as summing. But assuming your real data has different values of incident
, you could so something like this:
> DT.dt3[,list(Incidents=sum(incident)),by=Year]
Year Incidents
1: 2000 77214
2: 2001 77385
3: 2002 77080
4: 2003 76609
5: 2004 77197
6: 2005 76994
7: 2006 76560
8: 2007 76904
9: 2008 76786
10: 2009 76765
11: 2010 76675
12: 2011 76868
13: 2012 76963
(where I called setkeyv(DT.dt3,cols="Year")
above).
Efficient DataTable Group By
You may use Linq.
var result = from row in dt.AsEnumerable()
group row by row.Field<int>("TeamID") into grp
select new
{
TeamID = grp.Key,
MemberCount = grp.Count()
};
foreach (var t in result)
Console.WriteLine(t.TeamID + " " + t.MemberCount);
Performance of Grouped Operations with data.table
I think your suspicion is correct.
As to why, we could make an educated guess.
EDIT: the information below ignores what data.table
does with GForce optimizations,
the functions GForce supports probably avoid copies of data similar to your C++ code but,
as Frank mentioned,:=
didn't try to detect GForce-optimizable expressions before 1.14.3,
and addrs
below is certainly not optimized by GForce.
The C++ code you wrote doesn't modify anything from its input data,
it only has to allocate out
.
As you correctly said,data.table
aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z
according to the group indices,
effectively copying some of the input multiple times.
I wanted to try to validate my assumption,
so I wrote some C++ code to look at memory addresses:
set.seed(31L)
DT <- data.table(w = sample(1:3, size = 20, replace = TRUE),
x = sample(1:3, size = 20, replace = TRUE),
y = sample(1:3, size = 20, replace = TRUE),
z = runif(n = 20, min = 0, max = 1))
setkey(DT, w, x, y)
cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
StringVector ans(indices.size());
for (auto i = 0; i < indices.size(); ++i) {
std::ostringstream oss;
oss << "Group" << group_ids[i] << " " << &z[indices[i]];
ans[i] = oss.str();
}
return ans;
}
')
idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]
DT[, list(addrs(z, idx$indices, idx$group_ids))]
# V1
# 1: Group1 0x55b7733361f8
# 2: Group2 0x55b773336200
# 3: Group3 0x55b773336210
# 4: Group4 0x55b773336220
# 5: Group5 0x55b773336228
# 6: Group6 0x55b773336230
# 7: Group7 0x55b773336238
# 8: Group8 0x55b773336248
# 9: Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290
As expected here,
if we look at z
as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 + 0x8
.
You can execute the last line from the previous code multiple times and it will always show the same addresses.
What I partially didn't expect is this:
DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
# w x y V1
# 1: 1 1 2 Group1 0x55b771b03440
# 2: 1 1 3 Group2 0x55b771b03440
# 3: 1 2 1 Group3 0x55b771b03440
# 4: 1 2 2 Group4 0x55b771b03440
# 5: 1 3 2 Group5 0x55b771b03440
# 6: 1 3 3 Group6 0x55b771b03440
# 7: 2 1 1 Group7 0x55b771b03440
# 8: 2 2 1 Group8 0x55b771b03440
# 9: 2 2 2 Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440
On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table
somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(ツ)_/¯
EDIT: I added the following as the first line of addrs
(double escape because of R parsing)
Rcout << "input length = " << z.size() << "\\n";
and if I run the last DT[...]
code from above,
it does print different lengths even though the address is the same.
Efficient way of adding row with difference from specific group in data table
You could do
require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][]
# cat ind counts
# 1: A 1 10
# 2: A 2 9
# 3: A 3 8
# 4: A 0 0
# 5: B 4 7
# 6: B 5 6
# 7: B 0 14
# 8: C 6 5
# 9: C 7 4
# 10: C 0 18
# 11: D 8 3
# 12: D 0 24
What is the most efficient method for finding row indices by group in a data.table in R?
Using .I
, this option returns a data.table
with two columns. The first column is the unique values in a
, and the second is a list of indices where each value appears in k
. The form is different than the OP's m
, but the information is all there and just as easily accessible.
k[, .(idx = .(.I)), a]
Benchmarking:
library(data.table)
k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))
microbenchmark::microbenchmark(
A = {
u <- unique(k$a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
B = {
m2 <- k[, .(idx = .(.I)), a]
},
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929 100
#> B 9.7870 10.3598 13.04379 10.8292 12.73785 65.4864 100
all.equal(m, m2$idx)
#> [1] TRUE
all.equal(u, m2$a)
#> [1] TRUE
Efficiently Repeating Observations by Group
A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:
merge(DT, data.frame(x=rep), by="x")
x y
1: A 1
2: A 1
3: A 1
4: A 2
5: A 2
6: A 2
7: B 3
8: B 3
9: B 4
10: B 4
11: C 5
12: C 6
Efficient filtering through multiple columns by group
Here is an alternative tidyverse
approach.
my_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(!grepl("X", paste(var1, var2, var3, collapse = ""))) %>%
ungroup()
}
my_fun(df)
# # A tibble: 2 x 4
# id var1 var2 var3
# <int> <chr> <chr> <chr>
# 1 3 Z1 Z1 Z1
# 2 3 Z2 Z2 Z2
df_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`))) %>%
ungroup()
}
performance <- bench::mark(
my_fun(df),
df_fun(df)
)
performance %>% select(1:4)
# # A tibble: 2 x 4
# expression min median `itr/sec`
# <bch:expr> <bch:tm> <bch:tm> <dbl>
# 1 my_fun(df) 2.6ms 2.7ms 364.
# 2 df_fun(df) 6.01ms 6.39ms 152.
Related Topics
Finding All References to a Method with Roslyn
Understanding .Asenumerable() in Linq to SQL
Play and Wait for Animation/Animator to Finish Playing
Could Not Load File or Assembly ... the Parameter Is Incorrect
How to Get Values from Igrouping
Change Values in JSON File (Writing Files)
Fast Way to Convert a Two Dimensional Array to a List ( One Dimensional )
Is There Anything Like Asynchronous Blockingcollection<T>
How to Restart a Wpf Application
How to Change Symbol for Decimal Point in Double.Tostring()
Send HTML Email via C# with Smtpclient