My group by doesn't appear to be working in disk frames
Author of {disk.frame} here.
The issue is that currently, {disk.frame} doesn't the group by within
each chunk. It does not do group-by globally like how dplyr syntax would do.
So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.
As @Waldi pointed out, {disk.frame}
's dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.
{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.
Please DM me if anyone/organization would like to fund the development of this feature.
Do I have to use collect with disk frames?
You should always use srckeep
to load only those columns you need into memory.
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
# select(key_a, key_b, id) %>% # no need if you use srckeep
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
collect
will only bring the results of computing chunk_group_by
and chunk_summarize
into RAM. It shouldn't crash your machine.
You must use collect
just like other systems like Spark.
But if you are computing n_distinct
, that can be done in one-stage anyway
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
If you really concerned about RAM usage, you can reduce the number of workers to 1
setup_disk.frame(workers=1)
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
setup_disk.frame()
Is n_distinct an exact calculation with disk frames?
The implementation of n_distinct
can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R
#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
if(na.rm) {
setdiff(unique(x), NA)
} else {
unique(x)
}
}
#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
n_distinct(unlist(listx))
}
Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique
within each chunk, and then n_distinct
on result of all chunks once collected.
But I can't rule out if there is a bug elsewhere.
Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?
How does srckeep affect the underlying disk frame?
When you apply an operation, it doesn't change the underly disk.frame at all!
srckeep
only affects what gets used! It loads only those columns in srckeep
in memory when doing the processing. Again, it doesn't affect the underlying data at all.
Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE)
which will overwrite the old disk.frame.
The disk.frame is always on disk. You can see where it is with attr(diskf, "path")
Pandas - GroupBy and then Merge on original table
By default, groupby
output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index
parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
Why do I get an Object not found error when using group_by and summarise() in r?
From @mouli3c3 on twitter:
I know what caused the problem. Cant explain clearly why though.
library(operators) is some how masking/changing the original behaviour
of %>%. Adding library(magrittr) below librarary(operators) solved the
problem. Let me know if it works.
It worked! :)
Adding missing grouping variables message in dplyr in R
For consistency sake the grouping variables should be always present when defined earlier and thus are added when select(value)
is executed. ungroup
should resolve it:
qu25 <- mydata %>%
group_by(month, day, station_number) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(value)
The requested result is without warnings:
> mydata %>%
+ group_by(month, day, station_number) %>%
+ arrange(desc(value)) %>%
+ slice(2) %>%
+ ungroup() %>%
+ select(value)
# A tibble: 1 x 1
value
<dbl>
1 113
Related Topics
Choose Specific Number with Probability
Using Ggplot2 with Columns That Have Spaces in Their Names
Why Does "Hello" > 0 Return True
Changing Names in a List of Dataframes
Why Does 1..99,999 == "1".."99,999" in R, But 100,000 != "100,000"
How to Plot Charts with Nested Categories Axes
Data.Frames in R: Name Autocompletion
Multiplying Combinations of a List of Lists in R
Download .Rdata and .CSV Files from Ftp Using Rcurl (Or Any Other Method)
How Can One Mix 2 or More Color Palettes to Show a Combined Color Value
Character "|" in Strsplit Function (Vertical Bar/Pipe)
Downgrade R Version (No Issues with Bioconductor Installation)
Get Rows of Unique Values by Group
Highlight a Single "Bar" in Ggplot
Modify Spacing Between Key Glyphs in Vertical Legend Whilst Keeping Key Glyph Border
Backports 1.1.1 Package Fails to Install
Empty Output When Reading a CSV File into Rstudio Using Sparkr