Why Does Median Trip Up Data.Table (Integer Versus Double)

Why does median trip up data.table (integer versus double)?

TL;DR wrap median with as.double()

median() 'trips up' data.table because --- even when only passed integer vectors --- median() sometimes returns an integer value, and sometimes returns a double.

## median of 1:3 is 2, of type "integer" 
typeof(median(1:3))
# [1] "integer"

## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"

Reproducing your error message with a minimal example:

library(data.table)
dt <- data.table(patients = c(1:3, 1:2),
weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))

dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) :
# columns of j don't evaluate to consistent types for each group:
# result for group 2 has column 1 type 'double' but expecting type 'integer'

data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.


data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.

In this case, using as.double(median(X)) instead of median(X) provides a suitable fix.

(By the way, your version using round() worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3))).)

Why does median and coalesce not work with uneven number of rows?

The median documentation says

The default method returns a length-one object of the same type as x,
except when x is logical or integer of even length, when the result
will be double."

And the error you see is not thrown if df$ID is set to as.numeric. Suggests coalesce is getting confused by the df$ID class.

library(dplyr)
df <- data.frame(ID = 1:7,
Group = c(1, 1, 1, 2, 2, 2, 1),
val1 = c(1, NA, 3, 2, 2, 3, 2),
val2 = c(2, 2, 2, NA, 1, 3, 2))

# convert ID to numeric
df$ID <- as.numeric(df$ID)

df %>%
group_by(Group) %>%
mutate_at(vars(-group_cols()), ~coalesce(., median(.,na.rm=TRUE))) %>%
ungroup()

Notice also how the class of ID can vary depending on how it is input:

IDa = 1:7
class(IDa)

IDb = c(1,2,3,4,5,6,7)
class(IDb)

IDc = c(1L,2L,3L,4L,5L,6L,7L)
class(IDc)

data.table := does not support logical data types when adding new column?

Until this bug is fixed (see Matthew Dowle's comment above), you can get around it by directly specifying the type of NA that you want in the new column (except of course for "logical", which is the type that doesn't work at the moment):

DT <- data.table(a=LETTERS[c(1,1:3)],b=4:7,key="a")
DT[ ,newcol := NA_real_] ## Other options are NA_integer_ and NA_character_
# a b newcol
# 1: A 4 NA
# 2: A 5 NA
# 3: B 6 NA
# 4: C 7 NA

## Plain old NA has type and class "logical", partly explaining the
## error message returned by DT[,newcol:=NA]
c(typeof(NA), class(NA))
# [1] "logical" "logical"


Related Topics



Leave a reply



Submit