Dplyr Row_Number Error in Rank

dplyr row_number Error in rank

You have loaded plyr after loading dplyr. This overwrites dplyr::mutate with plyr::mutate (which is stated in a warning message). To remedy this issue, either load plyr first and then load dplyr, or explicitly reference dplyr in the mutate call:

out %>% group_by(site) %>% dplyr::mutate(row = paste0("corr", row_number()))

row_number() from dplyr: An example with vector

If you look at help("row_number"), then you can find that:

row_number(): equivalent to rank(ties.method = "first")

Moreover, as you vector contains also NA, it should be noted that:

x (is) a vector of values to rank. Missing values are left as is. If
you want to treat them as the smallest or largest values, replace with
Inf or -Inf before ranking.

What can be used instead is:

seq_along(x)

[1] 1 2 3 4 5 6

Rank variable by group (dplyr)

The following produces the desired result as was specified.

library(dplyr)

by_species <- iris %>% arrange(Species, Sepal.Length) %>%
    group_by(Species) %>% 
    mutate(rank = rank(Sepal.Length, ties.method = "first"))

by_species %>% filter(rank <= 3)
##Source: local data frame [9 x 6]
##Groups: Species [3]
##
##  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
##         (dbl)       (dbl)        (dbl)       (dbl)     (fctr) (int)
##1          4.3         3.0          1.1         0.1     setosa     1
##2          4.4         2.9          1.4         0.2     setosa     2
##3          4.4         3.0          1.3         0.2     setosa     3
##4          4.9         2.4          3.3         1.0 versicolor     1
##5          5.0         2.0          3.5         1.0 versicolor     2
##6          5.0         2.3          3.3         1.0 versicolor     3
##7          4.9         2.5          4.5         1.7  virginica     1
##8          5.6         2.8          4.9         2.0  virginica     2
##9          5.7         2.5          5.0         2.0  virginica     3

by_species %>% slice(1:3)
##Source: local data frame [9 x 6]
##Groups: Species [3]
##
##  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
##         (dbl)       (dbl)        (dbl)       (dbl)     (fctr) (int)
##1          4.3         3.0          1.1         0.1     setosa     1
##2          4.4         2.9          1.4         0.2     setosa     2
##3          4.4         3.0          1.3         0.2     setosa     3
##4          4.9         2.4          3.3         1.0 versicolor     1
##5          5.0         2.0          3.5         1.0 versicolor     2
##6          5.0         2.3          3.3         1.0 versicolor     3
##7          4.9         2.5          4.5         1.7  virginica     1
##8          5.6         2.8          4.9         2.0  virginica     2
##9          5.7         2.5          5.0         2.0  virginica     3

unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df

Interesting. Your benchmarks spiked my interest. I find it a bit odd that you don't compare against data.table's unique.data.table directly. So here are the results by including that as well on my system.

# extra function with which the benchmark shown below was run
dt_direct <- function() unique(dt) # where dt = as.data.table(little)

# Unit: milliseconds
#         expr       min        lq    median        uq       max neval
#       dt_u() 1472.2460 1571.0871 1664.0476 1742.5184 2647.2118    20
#       df_u() 6084.2877 6303.9058 6490.1686 6844.8767 7370.3322    20
#      dt_ss() 1340.8479 1485.4064 1552.8756 1586.6706 1810.2979    20
#      df_ss()  799.5289  835.8599  884.6501  957.2208 1251.5994    20
#      df_rn() 1410.0145 1576.2033 1660.1124 1770.2645 2442.7578    20
#  dt_direct()  452.6010  463.6116  486.5015  568.0451  670.3673    20

It's 1.8x faster than the fastest solution from all your runs.

Now, let's increase the number of unique values from 676 to about 10,000 and see what happens.

val = paste0("V", 1:100)
little <- data.frame(Var1=sample(val, 1e7, TRUE), Var2=sample(val, 1e7, TRUE))
dt <- as.data.table(little)

# Unit: milliseconds
#         expr      min        lq    median        uq       max neval
#       dt_u() 1709.458 1776.3510 1892.7761 1991.6339 2562.9171    20
#       df_u() 7541.364 7735.4725 7981.3483 8462.9093 9552.8629    20
#      dt_ss() 1555.110 1627.6519 1791.5219 1911.3594 2299.2864    20
#      df_ss() 1436.355 1500.1043 1528.1319 1649.3043 1961.9945    20
#      df_rn() 2001.396 2189.5164 2393.8861 2550.2198 3047.7019    20
#  dt_direct()  508.596  525.7299  577.6982  674.2288  893.2116    20

And here, it's 2.6x faster.

Note: I don't time the creation of dt here because, in real use cases, you can either use fread to get a data.table directly, or use setDT to convert a data.table by reference or directly use data.table(.) instead of data.fame(.) - which is not timed as well.

But why are both dt_u and dt_ss slower then?

By looking at the file grouped-dt.r and manip-grouped-dt.r, this is happening because of 1) copies and 2) setting keys. (1) is basically because of having to do (2). If you do a summarise operation using dplyr, it's equivalent to:

DT <- copy(DT);
setkey(DT, <group_cols>  ## these two are in grouped_dt
DT[, j, by=<group_cols>] ## this is in summarise.grouped_dt
DT <- copy(DT)           ## because it calls grouped_dt AGAIN!
## and sets key again - which is O(n) now as DT checked if sorted first..

I'm not sure why ad-hoc grouping was not implemented, after this discussion under Hadey's answer.

## equivalent ad-hoc by
DT[, j, by=<group_cols] ## no copy, no setkey

It avoids both copies and setting key.

It is even worse if you mutate. It's effectively doing:

DT <- copy(DT)
setkey(DT, <group_cols>) ## these two are in grouped_dt
DT <- copy(DT)           ## mutate.grouped_dt copies copied data again
DT[, `:=`(...), by=<group_cols>] ## this is in mutate.grouped_dt
DT = copy(DT) ## because of another call to grouped_dt!!!
## and sets key again - which is O(n) now as DT is checked if sorted first..

Here again, the ad-hoc solution is simply:

DT   = copy(DT)
DT[, `:=`(...), by=group_cols]

It avoids 2 copies and setting key.. The only copy is there to satisfy dplyr's philosophy of not modifying objects in-place. So, this'll always be slower + taking up twice the memory in dplyr.

Similarly, copies on some joins can be avoided as I've commented here.

The NEWS item from dplyr v0.2 says:

dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn't own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).

But clearly quite some discussed cases haven't made it.

So far I wrote about the performance tag under your question. That is, if you're looking for performance, you should be avoiding all cases which makes (unnecessary) copies (and setting keys), until fixed.

In that essence, in this particular case, the best answer I could come up with is just call unique.data.table directly in dplyrish way:

tbl_dt(little) %>% unique(.)

dplyr row_number() behavior in boolean operation

This issue has been closed with this commit.

The current version available on CRAN at the time of this answer is version 0.3.0.2. This version demonstrates the behavior you are expecting.

temp <- iris %>%
  mutate(newcol = row_number() - Sepal.Length < 0 | 
           row_number() - Petal.Length < 0)
head(temp)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol
# 1          5.1         3.5          1.4         0.2  setosa   TRUE
# 2          4.9         3.0          1.4         0.2  setosa   TRUE
# 3          4.7         3.2          1.3         0.2  setosa   TRUE
# 4          4.6         3.1          1.5         0.2  setosa   TRUE
# 5          5.0         3.6          1.4         0.2  setosa  FALSE
# 6          5.4         3.9          1.7         0.4  setosa  FALSE

Assigning row_number() as a column value over a list of dataframes in R

You can use lapply and cbind to assign rownumbers:

lapply(df_list, function(x) cbind(x, rank=seq_len(NROW((x)))))
#[[1]]
#  a  b rank
#1 A 12    1
#2 B  8    2
#3 C  9    3
#4 D  5    4
#5 E  6    5
#6 F 18    6
#
#[[2]]
#  c  b rank
#1 K 12    1
#2 J  8    2
#3 L  9    3
#4 M  5    4
#5 N  6    5
#6 O 18    6

Syntax error in SQL code being used in R, no row_number function found

It looks like you are missing a comma after the *. The following works as expected:

DBI::dbGetQuery(con, '
SELECT *,
  row_number() over(partition by cyl) AS rn
FROM sql_mtcars 
  ')

Note the addition of a comma after the * and before the row_number.

Dplyr Row_Number Error in Rank