What Is the Purpose of Setting a Key in Data.Table

What is the purpose of setting a key in data.table?

In addition to this answer, please refer to the vignettes Secondary indices and auto indexing and Keys and fast binary search based subset as well.

This issue highlights the other vignettes that we plan to.


I've updated this answer again (Feb 2016) in light of the new on= feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.

What exactly does setkey(DT, a, b) do?

It does two things:

  1. reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
  2. marks those columns as key columns by setting an attribute called sorted to DT.

The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).

When is setkey() required?

For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.

## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result

However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.

## joins using < v1.9.6 
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]

## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]

Note that on= argument can be explicitly specified even for keyed joins as well.

The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.

  • So what's the reason for implementing on= argument?

    There are quite a few reasons.

    1. It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.

    2. It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).

    3. In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,

       ## compare 
      setkey(X, a, b) # why physically reorder X to just add/update a column?
      X[Y, col := i.val]

      ## to
      X[Y, col := i.val, on=c("a", "b")]

      In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.

    4. Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.

This leads to the question, what advantage does keying a data.table have anymore?

  • Is there an advantage to keying a data.table?

    Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.

    It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.

Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).

In R's data.table, how is the key of a data.table preserved into subsets referenced using .SD?

In this case, since it's sorted by site, group, team, while grouping by site, the key could be retained for group, team as the order would be maintained. The simplest answer is we seem to have missed this case. Could you please file an issue with just a link to this post?

As a work around, you can use the by argument in unique method for data.tables to specify the columns.

And as David pointed out, using unique(.SD) on every group seems unnecessary, but that's probably for another Q.

Using a key to replace values across a whole data.table

You can use melt and dcast:

dcast(
rating[melt(df, id=c("V1", "V2"),value.name = "Rating"), on="Rating"],
V1+V2~variable, value.var = "CreditQuality"
)

Output:

             V1              V2 V3 V4 V5 V6 V7 V8 V9
1: XS0041971275 TR.IssuerRating 1 1 1 1 2 2 1
2: XS0043098127 TR.IssuerRating 6 6 6 6 6 6 6
3: XS0285400197 TR.IssuerRating 2 2 2 2 2 2 2

Note: I'm assuming your source data is df, and your Rating data is rating. I see that your frames are already of class data.table

How to know what's the key in an R data.table?

We can use key to get the keys of the data.table

key(df1)
#[1] "col1" "col2"

data

df1 <- data.table(col1 = 1:5, col2 = 6:10, col3 = 11:15)
setkey(df1, col1, col2)

In a keyed data.table, changing the value of the key variable at a certain `where` sets key to `NULL`

From setkey's help file:

‘setkey()’ sorts a ‘data.table’ and marks it as sorted. [...] The
columns are sorted in ascending order always.

When you replace elements in any of the keyed columns, the data.table is no longer ordered (or at least can't be guaranteed to be), so the key is unset to reflect that changed reality.

An easy solution is to just immediately reset the key:

## Creates the example data.table
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT, 'x')

## Immediately resets the (possibly multicolumn) key
setkeyv(DT["a", x:="z"], key(DT))

key(DT)
# [1] "x"

How data.table sorts strings when setting key

Update March 2014

There's been some debate about this one. As of v1.9.2 we've settled for now on setkey sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.

Consider save()-ing a keyed table in your locale and a colleague load()-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table can at least compare and detect if the current locale is different to the one that ran setkey.

It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.

Hence, this is now a feature request and further comments are very welcome.

FR#4842 setkey to sort using session's locale not C locale



Nice catch! The call to setkey in turn calls setkeyv and that calls fastorder to "order" the columns/entries that in turn calls chorder.

chorder in turn calls a C function Ccountingcharacter.c. Now, here I suppose the problem comes due to "locale".

Let's see what "locale" I'm on my mac.

Sys.getLocale()
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Now let's see how order sorts it:

x <- c("USA", "Ubuntu", "Uzbekistan")
order(x)
# [1] 2 1 3

Now, let's change the "locale" to "C".

Sys.setlocale("LC_ALL", "C")
# [1] "C/C/C/C/C/en_US.UTF-8"

order(x)
# [1] 1 2 3

From ?order:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

From ?Comparison:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....

So, basically, order as well under "C" locale, gives the same order as data.table's setkey. My guess is that the C-function called by chorder automatically runs on C-locale which will compare ascii values for which "S" comes before "b".

It's probably important to bring this to @MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).

setkey and the := operator, data.table, R

These 2 FAQs seem close :

3.2 I don't have a key on a large table, but grouping is still really quick. Why is that?

data.table uses radix sorting. This is signicantly faster than other sort algorithms. Radix is specifically for integers only, see ?base::sort.list(x,method="radix"). This is also one reason why setkey is quick. When no key is set, or we group in a different order from that of the key, we call it an ad hoc by.

3.3 Why is grouping by columns in the key faster than an ad hoc by?

Because each group is contiguous in RAM, thereby minimising page fetches, and memory can be copied in bulk (memcpy in C) rather than looping in C.

What it doesn't say, and probably should do, is that you need a very large dataset, where each group is also very large, before you notice the difference between keyed by and ad hoc by. Something like 100 groups of 100MB each (a 10GB data.table) such as 1e8 rows and 13 columns. Otherwise, there is no need to setkey first, especially since that can get onerous.

data.table - does setkey(...) create an index or physically reorder the rows in a data table?

  1. The rows are sorted. "Changed by reference" here means there is no copying of the entire table and rows are just swapped.

  2. setkey(DT, NULL) is equivalent to setattr(DT, "sorted", NULL). It simply unsets the "sorted" attribute.

R Fastest way to get value from a data.table under multiple criteria

I think the approach will depend on the exact structure of your multiple values (13, and where these are stored), but you could think about using joins, etc.

melt(value.dt[, number:=.I],id.vars = "number", variable.name = "ID")[
,`:=`(ID=str_sub(ID, 1,4), gender=str_sub(ID,-1,-1))][
dt[supp.dt,on=c("criteria"="crits")],
on=.(ID,gender,number)]

Output:

   number     ID     value gender   person criteria
<int> <char> <num> <char> <char> <char>
1: 5 ID.1 0.8572478 M Rick A
2: 77 ID.1 0.6211473 M Ryan A
3: 62 ID.2 1.8570321 F Michelle B
4: 25 ID.3 2.5732931 M Richard C
5: 91 ID.3 2.0300149 M Larry C

Update

If you have a number of "value" frames, you can take the following approach:

First, put them in a named list

value_frames = list("valueoneID" = valueone.dt,"valuetwoID" = valuetwo.dt)

Second, create a single value.dt data.table that row-binds melted versions of these different frames from value_frames

value.dt = rbindlist(
lapply(value_frames, \(f) melt(f[,number:=.I], id.vars="number", variable.name="ID")),
idcol = "vsrc"
)

Then, join value.dt to a melted version of a join between dt and supp.dt, and dcast the result back to wide.

dcast(
value.dt[,`:=`(ID = str_sub(ID,1,-2), gender=str_sub(ID,-1,-1))][
melt(dt[supp.dt,on=c("criteria"="crits")],
measure.vars = patterns("value"),
variable.name = "vsrc",
value.name = "ID"),
on=.(ID,gender,number,vsrc)],
gender+person+number~vsrc, value.var="value"
)


Related Topics



Leave a reply



Submit