Efficient Datatable Group By

c# DataTable: More efficient Group By and Sum?

//your method
public void YourMethod()
{
     Dictionary<int, int> result = new Dictionary<int, int>();

     int length = 0;

     if(dt1.Rows.Count > dt2.Rows.Count)
        length = dt1.Rows.Count
     else
         length = dt2.Rows.Count

     for(int i=0; i < length - 1; i++)
     {
         AddRowValue(dt1, result, i);
         AddRowValue(dt2, result, i);
     }  

}

public AddRowValue(DataTable tbl, Dictionary<int, int> dic, int index)
{
    if( index > tbl.Rows.Count)
       return;

    DataRow row = tbl.Rows[index];

    int idValue = Convert.ToInt32(row["ID"]);
    int quantityValue = Convert.ToInt32(row["Quantity"]);

    if(dic.Keys.Contains(idValue)
         dic[idValue] = dic[idValue] + quantityValue;
    else
         dic.Add(idValue, quantityValue);
}

You need something like this, you can use dictionary at the end the result will be stored in the dictionary.

Efficient way to group data.table results by year

When you are doing aggregate operations (grouping) with data.tables, especially for large data sets, you should set the field you are grouping by as a key (using setkeyv(DT, "your_key_field"), etc...). Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table:: functions / operations within your data.table object than you would when using other packages' functions, like plyr::count for example. Below, I made a few data.table objects - the first is identical to your example; the second adds a column Year (instead of calculating format(Date,"%Y") at the time of function execution), but sets Date as the key; and the third is the same as the second, except that it uses Year as the key. I also made a few functions (for benchmarking convenience) that do the grouping in different ways.

library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
  from=as.Date("2000-01-01"),
  to=as.Date("2012-12-31"),
  by="day")
##
set.seed(123)
sampleDate <- sample(
  dates,
  1e06,
  replace=TRUE)
##
DT.dt <- data.table(
  Date=sampleDate,
  incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
         Date incident
1: 2003-09-27        1
2: 2010-04-01        1
3: 2005-04-26        1
> head(DT.dt2,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000
> head(DT.dt3,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000

## your original method
f1 <- function(dt)
{
  dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
  dt[,count(Year)]
}
## use 'Date' column; '.N' and 
## 'by=' instead of 'count'
f2 <- function(dt)
{
  dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
  dt[,.N,by=Year]
}
##
Res <- microbenchmark(
  f1(DT.dt),
  f1.2(DT.dt2),
  f1.2(DT.dt3),
  f2(DT.dt2),
  f3(DT.dt3))
##
> Res
Unit: milliseconds
         expr        min         lq     median         uq      max neval
    f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724   100
 f1.2(DT.dt2)  98.722062 115.588034 126.332104 137.792116 223.4967   100
 f1.2(DT.dt3)  97.475673 118.134788 125.836817 136.136156 238.2697   100
   f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674   100
   f3(DT.dt3)   7.912803   8.441159   8.736887   9.685267  76.9629   100

Observations:

Grouping by the precalculated field Year instead of calculating
format(Date,"%Y") at execution time was a definite improvement -
for both of the count and .N approaches. You can see this by
comparing the f1() and f2() times to the f1.2() times.
The count approach seemed to be a little slower than the .N & 'by=' approach (f1() compared to f2().
The best approach by far was to use the precalculated field Year and the native data.table grouping .N & by=; f3() was considerably faster than the other four timings.

There are some pretty experience data.table users on SO, certainly more so than myself, so there may be an even faster way to do this. All else aside, though, it's definitely a good idea to set a key on your data.table; and it certainly seems like you would be much better off precalculating a field like Year than doing so "on the fly"; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL].

Also, you said you are trying to count the number of incidents per year - and since your example data had incident = 1 for all rows, counting was the same as summing. But assuming your real data has different values of incident, you could so something like this:

> DT.dt3[,list(Incidents=sum(incident)),by=Year]
    Year Incidents
 1: 2000     77214
 2: 2001     77385
 3: 2002     77080
 4: 2003     76609
 5: 2004     77197
 6: 2005     76994
 7: 2006     76560
 8: 2007     76904
 9: 2008     76786
10: 2009     76765
11: 2010     76675
12: 2011     76868
13: 2012     76963

(where I called setkeyv(DT.dt3,cols="Year") above).

Efficient DataTable Group By

You may use Linq.

var result = from row in dt.AsEnumerable()
              group row by row.Field<int>("TeamID") into grp
               select new
                 {
                 TeamID = grp.Key,
                  MemberCount = grp.Count()
                  };
 foreach (var t in result)
     Console.WriteLine(t.TeamID + " " + t.MemberCount);

Performance of Grouped Operations with data.table

I think your suspicion is correct.
As to why, we could make an educated guess.

EDIT: the information below ignores what data.table does with GForce optimizations,
the functions GForce supports probably avoid copies of data similar to your C++ code but,
as Frank mentioned,
:= didn't try to detect GForce-optimizable expressions before 1.14.3,
and addrs below is certainly not optimized by GForce.

The C++ code you wrote doesn't modify anything from its input data,
it only has to allocate out.
As you correctly said,
data.table aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z according to the group indices,
effectively copying some of the input multiple times.

I wanted to try to validate my assumption,
so I wrote some C++ code to look at memory addresses:

set.seed(31L)
DT <- data.table(w = sample(1:3, size  = 20, replace = TRUE), 
                 x = sample(1:3, size  = 20, replace = TRUE), 
                 y = sample(1:3, size  = 20, replace = TRUE),
                 z = runif(n = 20, min = 0, max = 1))

setkey(DT, w, x, y)

cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
    StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
        StringVector ans(indices.size());
        for (auto i = 0; i < indices.size(); ++i) {
            std::ostringstream oss;
            oss << "Group" << group_ids[i] << " " << &z[indices[i]];
            ans[i] = oss.str();
        }
        return ans;
    }
')

idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]

DT[, list(addrs(z, idx$indices, idx$group_ids))]
#                         V1
#  1:  Group1 0x55b7733361f8
#  2:  Group2 0x55b773336200
#  3:  Group3 0x55b773336210
#  4:  Group4 0x55b773336220
#  5:  Group5 0x55b773336228
#  6:  Group6 0x55b773336230
#  7:  Group7 0x55b773336238
#  8:  Group8 0x55b773336248
#  9:  Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290

As expected here,
if we look at z as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 + 0x8.
You can execute the last line from the previous code multiple times and it will always show the same addresses.

What I partially didn't expect is this:

DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
#     w x y                     V1
#  1: 1 1 2  Group1 0x55b771b03440
#  2: 1 1 3  Group2 0x55b771b03440
#  3: 1 2 1  Group3 0x55b771b03440
#  4: 1 2 2  Group4 0x55b771b03440
#  5: 1 3 2  Group5 0x55b771b03440
#  6: 1 3 3  Group6 0x55b771b03440
#  7: 2 1 1  Group7 0x55b771b03440
#  8: 2 2 1  Group8 0x55b771b03440
#  9: 2 2 2  Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440

On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(ツ)_/¯

EDIT: I added the following as the first line of addrs
(double escape because of R parsing)

Rcout << "input length = " << z.size() << "\\n";

and if I run the last DT[...] code from above,
it does print different lengths even though the address is the same.

Efficient way of adding row with difference from specific group in data table

You could do

require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][]
#     cat ind counts
#  1:   A   1     10
#  2:   A   2      9
#  3:   A   3      8
#  4:   A   0      0
#  5:   B   4      7
#  6:   B   5      6
#  7:   B   0     14
#  8:   C   6      5
#  9:   C   7      4
# 10:   C   0     18
# 11:   D   8      3
# 12:   D   0     24

What is the most efficient method for finding row indices by group in a data.table in R?

Using .I, this option returns a data.table with two columns. The first column is the unique values in a, and the second is a list of indices where each value appears in k. The form is different than the OP's m, but the information is all there and just as easily accessible.

k[, .(idx = .(.I)), a]

Benchmarking:

library(data.table)

k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))

microbenchmark::microbenchmark(
  A = {
    u <- unique(k$a)
    m <- lapply(u, function(x) k[a == x, which = TRUE])
  },
  B = {
    m2 <- k[, .(idx = .(.I)), a]
  },
  times = 100
)
#> Unit: milliseconds
#>  expr      min       lq      mean   median        uq      max neval
#>     A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929   100
#>     B   9.7870  10.3598  13.04379  10.8292  12.73785  65.4864   100

all.equal(m, m2$idx)
#> [1] TRUE

all.equal(u, m2$a)
#> [1] TRUE

Efficiently Repeating Observations by Group

A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:

merge(DT, data.frame(x=rep), by="x")

    x y
 1: A 1
 2: A 1
 3: A 1
 4: A 2
 5: A 2
 6: A 2
 7: B 3
 8: B 3
 9: B 4
10: B 4
11: C 5
12: C 6

Efficient filtering through multiple columns by group

Here is an alternative tidyverse approach.

my_fun <- function(.data) {
  .data %>% 
    group_by(id) %>% 
    filter(!grepl("X", paste(var1, var2, var3, collapse = ""))) %>% 
    ungroup()
}

my_fun(df)

# # A tibble: 2 x 4
#      id var1  var2  var3 
#   <int> <chr> <chr> <chr>
# 1     3 Z1    Z1    Z1   
# 2     3 Z2    Z2    Z2   

df_fun <- function(.data) {
  .data %>%
    group_by(id) %>%
    filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`))) %>% 
    ungroup()
}

performance <- bench::mark(
  my_fun(df),
  df_fun(df)
)

performance %>% select(1:4)

# # A tibble: 2 x 4
#   expression       min   median `itr/sec`
#   <bch:expr>  <bch:tm> <bch:tm>     <dbl>
# 1 my_fun(df)    2.6ms    2.7ms      364.
# 2 df_fun(df)    6.01ms   6.39ms      152.