Why does median trip up data.table (integer versus double)?
TL;DR wrap median
with as.double()
median()
'trips up' data.table because --- even when only passed integer vectors --- median()
sometimes returns an integer value, and sometimes returns a double.
## median of 1:3 is 2, of type "integer"
typeof(median(1:3))
# [1] "integer"
## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"
Reproducing your error message with a minimal example:
library(data.table)
dt <- data.table(patients = c(1:3, 1:2),
weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))
dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) :
# columns of j don't evaluate to consistent types for each group:
# result for group 2 has column 1 type 'double' but expecting type 'integer'
data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.
data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table
's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.
In this case, using as.double(median(X))
instead of median(X)
provides a suitable fix.
(By the way, your version using round()
worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3)))
.)
Why does median and coalesce not work with uneven number of rows?
The median
documentation says
The default method returns a length-one object of the same type as x,
except when x is logical or integer of even length, when the result
will be double."
And the error you see is not thrown if df$ID is set to as.numeric
. Suggests coalesce
is getting confused by the df$ID
class.
library(dplyr)
df <- data.frame(ID = 1:7,
Group = c(1, 1, 1, 2, 2, 2, 1),
val1 = c(1, NA, 3, 2, 2, 3, 2),
val2 = c(2, 2, 2, NA, 1, 3, 2))
# convert ID to numeric
df$ID <- as.numeric(df$ID)
df %>%
group_by(Group) %>%
mutate_at(vars(-group_cols()), ~coalesce(., median(.,na.rm=TRUE))) %>%
ungroup()
Notice also how the class
of ID can vary depending on how it is input:
IDa = 1:7
class(IDa)
IDb = c(1,2,3,4,5,6,7)
class(IDb)
IDc = c(1L,2L,3L,4L,5L,6L,7L)
class(IDc)
data.table := does not support logical data types when adding new column?
Until this bug is fixed (see Matthew Dowle's comment above), you can get around it by directly specifying the type of NA that you want in the new column (except of course for "logical", which is the type that doesn't work at the moment):
DT <- data.table(a=LETTERS[c(1,1:3)],b=4:7,key="a")
DT[ ,newcol := NA_real_] ## Other options are NA_integer_ and NA_character_
# a b newcol
# 1: A 4 NA
# 2: A 5 NA
# 3: B 6 NA
# 4: C 7 NA
## Plain old NA has type and class "logical", partly explaining the
## error message returned by DT[,newcol:=NA]
c(typeof(NA), class(NA))
# [1] "logical" "logical"
Related Topics
How to Request an Early Exit When Knitting an Rmd Document
Regression Tables in Markdown Format (For Flexible Use in R Markdown V2)
Determining the Distance Between Two Zip Codes (Alternatives to Mapdist)
Histogram with "Negative" Logarithmic Scale in R
How to Change the Na Color from Gray to White in a Ggplot Choropleth Map
Forcing R Output to Be Scientific Notation with at Most Two Decimals
Ggplot Graphing of Proportions of Observations Within Categories
R How to Read a File from Google Drive Using R
Is Data Really Copied Four Times in R's Replacement Functions
How to Convert Entire Dataframe to Numeric While Preserving Decimals
Optimal/Efficient Plotting of Survival/Regression Analysis Results
Add Density Lines to Histogram and Cumulative Histogram
Changing Title in Multiplot Ggplot2 Using Grid.Arrange
Initialize an Empty Tibble with Column Names and 0 Rows
R: Numeric 'Envir' Arg Not of Length One in Predict()