My Group by Doesn't Appear to Be Working in Disk Frames

My group by doesn't appear to be working in disk frames

Author of {disk.frame} here.

The issue is that currently, {disk.frame} doesn't the group by within each chunk. It does not do group-by globally like how dplyr syntax would do.

So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.

As @Waldi pointed out, {disk.frame}'s dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.

{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.

Please DM me if anyone/organization would like to fund the development of this feature.

Do I have to use collect with disk frames?

You should always use srckeep to load only those columns you need into memory.

my_df %>% 
srckeep(c("key_a", "key_b", "id")) %>%
# select(key_a, key_b, id) %>% # no need if you use srckeep
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)

collect will only bring the results of computing chunk_group_by and chunk_summarize into RAM. It shouldn't crash your machine.

You must use collect just like other systems like Spark.

But if you are computing n_distinct, that can be done in one-stage anyway

 my_df %>% 
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect

If you really concerned about RAM usage, you can reduce the number of workers to 1

setup_disk.frame(workers=1)
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect

setup_disk.frame()

Is n_distinct an exact calculation with disk frames?

The implementation of n_distinct can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
if(na.rm) {
setdiff(unique(x), NA)
} else {
unique(x)
}
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
n_distinct(unlist(listx))
}

Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique within each chunk, and then n_distinct on result of all chunks once collected.

But I can't rule out if there is a bug elsewhere.

Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?

How does srckeep affect the underlying disk frame?

When you apply an operation, it doesn't change the underly disk.frame at all!

srckeep only affects what gets used! It loads only those columns in srckeep in memory when doing the processing. Again, it doesn't affect the underlying data at all.

Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE) which will overwrite the old disk.frame.

The disk.frame is always on disk. You can see where it is with attr(diskf, "path")

Pandas - GroupBy and then Merge on original table

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150

SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140

Why do I get an Object not found error when using group_by and summarise() in r?

From @mouli3c3 on twitter:

I know what caused the problem. Cant explain clearly why though.
library(operators) is some how masking/changing the original behaviour
of %>%. Adding library(magrittr) below librarary(operators) solved the
problem. Let me know if it works.

It worked! :)

Adding missing grouping variables message in dplyr in R

For consistency sake the grouping variables should be always present when defined earlier and thus are added when select(value) is executed. ungroup should resolve it:

qu25 <- mydata %>% 
group_by(month, day, station_number) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(value)

The requested result is without warnings:

> mydata %>% 
+ group_by(month, day, station_number) %>%
+ arrange(desc(value)) %>%
+ slice(2) %>%
+ ungroup() %>%
+ select(value)
# A tibble: 1 x 1
value
<dbl>
1 113


Related Topics



Leave a reply



Submit