My Group by Doesn't Appear to Be Working in Disk Frames

My group by doesn't appear to be working in disk frames

Author of {disk.frame} here.

The issue is that currently, {disk.frame} doesn't the group by within each chunk. It does not do group-by globally like how dplyr syntax would do.

So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.

As @Waldi pointed out, {disk.frame}'s dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.

{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.

Please DM me if anyone/organization would like to fund the development of this feature.

Do I have to use collect with disk frames?

You should always use srckeep to load only those columns you need into memory.

my_df %>% 
        srckeep(c("key_a", "key_b", "id")) %>%
        # select(key_a, key_b, id) %>% # no need if you use srckeep
        chunk_group_by(key_a, key_b) %>% 
        # stage one
        chunk_summarize(count = n_distinct(id)) %>% 
        collect %>% 
        group_by(key_a, key_b) %>% 
        # stage two
        mutate(count_summed = sum(count)) %>%
        group_by(key_a) %>% 
        mutate(count_all = sum(count)) %>% 
        ungroup() %>% 
        mutate(percent_of_total = count_summed / count_all)

collect will only bring the results of computing chunk_group_by and chunk_summarize into RAM. It shouldn't crash your machine.

You must use collect just like other systems like Spark.

But if you are computing n_distinct, that can be done in one-stage anyway

 my_df %>% 
        srckeep(c("key_a", "key_b", "id")) %>%
        #select(key_a, key_b, id) %>% 
        group_by(key_a, key_b) %>% 
        # stage one
        summarize(count = n_distinct(id)) %>% 
        collect

If you really concerned about RAM usage, you can reduce the number of workers to 1

setup_disk.frame(workers=1)
my_df %>% 
        srckeep(c("key_a", "key_b", "id")) %>%
        #select(key_a, key_b, id) %>% 
        group_by(key_a, key_b) %>% 
        # stage one
        summarize(count = n_distinct(id)) %>% 
        collect

setup_disk.frame()

Is n_distinct an exact calculation with disk frames?

The implementation of n_distinct can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique within each chunk, and then n_distinct on result of all chunks once collected.

But I can't rule out if there is a bug elsewhere.

Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?

How does srckeep affect the underlying disk frame?

When you apply an operation, it doesn't change the underly disk.frame at all!

srckeep only affects what gets used! It loads only those columns in srckeep in memory when doing the processing. Again, it doesn't affect the underlying data at all.

Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE) which will overwrite the old disk.frame.

The disk.frame is always on disk. You can see where it is with attr(diskf, "path")

Pandas - GroupBy and then Merge on original table

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]: 
   EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \
0  123    GR     236   40000  1.805222e+31         350  10000         50   
1  123    GR     236   40000  1.805222e+31         350  30000        300   
2  123    GU     443   12000  8.765549e+87         250  10000        100   
3  123    GU     443   12000  8.765549e+87         250   2000        150   

   SC_Po  SI_Po  
0     23     40  
1    213    140  
2    230    400  
3    213    140

Why do I get an Object not found error when using group_by and summarise() in r?

From @mouli3c3 on twitter:

I know what caused the problem. Cant explain clearly why though.
library(operators) is some how masking/changing the original behaviour
of %>%. Adding library(magrittr) below librarary(operators) solved the
problem. Let me know if it works.

It worked! :)

Adding missing grouping variables message in dplyr in R

For consistency sake the grouping variables should be always present when defined earlier and thus are added when select(value) is executed. ungroup should resolve it:

qu25 <- mydata %>% 
  group_by(month, day, station_number) %>%
  arrange(desc(value)) %>% 
  slice(2) %>% 
  ungroup() %>%
  select(value)

The requested result is without warnings:

> mydata %>% 
+   group_by(month, day, station_number) %>%
+   arrange(desc(value)) %>% 
+   slice(2) %>% 
+   ungroup() %>%
+   select(value)
# A tibble: 1 x 1
  value
  <dbl>
1   113

My Group by Doesn't Appear to Be Working in Disk Frames