dplyr row_number Error in rank
You have loaded plyr
after loading dplyr
. This overwrites dplyr::mutate
with plyr::mutate
(which is stated in a warning message). To remedy this issue, either load plyr
first and then load dplyr
, or explicitly reference dplyr
in the mutate call:
out %>% group_by(site) %>% dplyr::mutate(row = paste0("corr", row_number()))
row_number() from dplyr: An example with vector
If you look at help("row_number")
, then you can find that:
row_number(): equivalent to rank(ties.method = "first")
Moreover, as you vector contains also NA, it should be noted that:
x (is) a vector of values to rank. Missing values are left as is. If
you want to treat them as the smallest or largest values, replace with
Inf or -Inf before ranking.
What can be used instead is:
seq_along(x)
[1] 1 2 3 4 5 6
Rank variable by group (dplyr)
The following produces the desired result as was specified.
library(dplyr)
by_species <- iris %>% arrange(Species, Sepal.Length) %>%
group_by(Species) %>%
mutate(rank = rank(Sepal.Length, ties.method = "first"))
by_species %>% filter(rank <= 3)
##Source: local data frame [9 x 6]
##Groups: Species [3]
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank
## (dbl) (dbl) (dbl) (dbl) (fctr) (int)
##1 4.3 3.0 1.1 0.1 setosa 1
##2 4.4 2.9 1.4 0.2 setosa 2
##3 4.4 3.0 1.3 0.2 setosa 3
##4 4.9 2.4 3.3 1.0 versicolor 1
##5 5.0 2.0 3.5 1.0 versicolor 2
##6 5.0 2.3 3.3 1.0 versicolor 3
##7 4.9 2.5 4.5 1.7 virginica 1
##8 5.6 2.8 4.9 2.0 virginica 2
##9 5.7 2.5 5.0 2.0 virginica 3
by_species %>% slice(1:3)
##Source: local data frame [9 x 6]
##Groups: Species [3]
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank
## (dbl) (dbl) (dbl) (dbl) (fctr) (int)
##1 4.3 3.0 1.1 0.1 setosa 1
##2 4.4 2.9 1.4 0.2 setosa 2
##3 4.4 3.0 1.3 0.2 setosa 3
##4 4.9 2.4 3.3 1.0 versicolor 1
##5 5.0 2.0 3.5 1.0 versicolor 2
##6 5.0 2.3 3.3 1.0 versicolor 3
##7 4.9 2.5 4.5 1.7 virginica 1
##8 5.6 2.8 4.9 2.0 virginica 2
##9 5.7 2.5 5.0 2.0 virginica 3
unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df
Interesting. Your benchmarks spiked my interest. I find it a bit odd that you don't compare against data.table
's unique.data.table
directly. So here are the results by including that as well on my system.
# extra function with which the benchmark shown below was run
dt_direct <- function() unique(dt) # where dt = as.data.table(little)
# Unit: milliseconds
# expr min lq median uq max neval
# dt_u() 1472.2460 1571.0871 1664.0476 1742.5184 2647.2118 20
# df_u() 6084.2877 6303.9058 6490.1686 6844.8767 7370.3322 20
# dt_ss() 1340.8479 1485.4064 1552.8756 1586.6706 1810.2979 20
# df_ss() 799.5289 835.8599 884.6501 957.2208 1251.5994 20
# df_rn() 1410.0145 1576.2033 1660.1124 1770.2645 2442.7578 20
# dt_direct() 452.6010 463.6116 486.5015 568.0451 670.3673 20
It's 1.8x faster than the fastest solution from all your runs.
Now, let's increase the number of unique values from 676 to about 10,000 and see what happens.
val = paste0("V", 1:100)
little <- data.frame(Var1=sample(val, 1e7, TRUE), Var2=sample(val, 1e7, TRUE))
dt <- as.data.table(little)
# Unit: milliseconds
# expr min lq median uq max neval
# dt_u() 1709.458 1776.3510 1892.7761 1991.6339 2562.9171 20
# df_u() 7541.364 7735.4725 7981.3483 8462.9093 9552.8629 20
# dt_ss() 1555.110 1627.6519 1791.5219 1911.3594 2299.2864 20
# df_ss() 1436.355 1500.1043 1528.1319 1649.3043 1961.9945 20
# df_rn() 2001.396 2189.5164 2393.8861 2550.2198 3047.7019 20
# dt_direct() 508.596 525.7299 577.6982 674.2288 893.2116 20
And here, it's 2.6x faster.
Note: I don't time the creation of
dt
here because, in real use cases, you can either usefread
to get a data.table directly, or usesetDT
to convert adata.table
by reference or directly usedata.table(.)
instead ofdata.fame(.)
- which is not timed as well.
But why are both dt_u
and dt_ss
slower then?
By looking at the file grouped-dt.r
and manip-grouped-dt.r
, this is happening because of 1) copies and 2) setting keys. (1) is basically because of having to do (2). If you do a summarise operation using dplyr
, it's equivalent to:
DT <- copy(DT);
setkey(DT, <group_cols> ## these two are in grouped_dt
DT[, j, by=<group_cols>] ## this is in summarise.grouped_dt
DT <- copy(DT) ## because it calls grouped_dt AGAIN!
## and sets key again - which is O(n) now as DT checked if sorted first..
I'm not sure why ad-hoc grouping was not implemented, after this discussion under Hadey's answer.
## equivalent ad-hoc by
DT[, j, by=<group_cols] ## no copy, no setkey
It avoids both copies and setting key.
It is even worse if you mutate. It's effectively doing:
DT <- copy(DT)
setkey(DT, <group_cols>) ## these two are in grouped_dt
DT <- copy(DT) ## mutate.grouped_dt copies copied data again
DT[, `:=`(...), by=<group_cols>] ## this is in mutate.grouped_dt
DT = copy(DT) ## because of another call to grouped_dt!!!
## and sets key again - which is O(n) now as DT is checked if sorted first..
Here again, the ad-hoc solution is simply:
DT = copy(DT)
DT[, `:=`(...), by=group_cols]
It avoids 2 copies and setting key.. The only copy is there to satisfy dplyr's philosophy of not modifying objects in-place. So, this'll always be slower + taking up twice the memory in dplyr
.
Similarly, copies on some joins can be avoided as I've commented here.
The NEWS item from dplyr v0.2
says:
- dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn't own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).
But clearly quite some discussed cases haven't made it.
So far I wrote about the performance tag under your question. That is, if you're looking for performance, you should be avoiding all cases which makes (unnecessary) copies (and setting keys), until fixed.
In that essence, in this particular case, the best answer I could come up with is just call unique.data.table
directly in dplyr
ish way:
tbl_dt(little) %>% unique(.)
dplyr row_number() behavior in boolean operation
This issue has been closed with this commit.
The current version available on CRAN at the time of this answer is version 0.3.0.2. This version demonstrates the behavior you are expecting.
temp <- iris %>%
mutate(newcol = row_number() - Sepal.Length < 0 |
row_number() - Petal.Length < 0)
head(temp)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol
# 1 5.1 3.5 1.4 0.2 setosa TRUE
# 2 4.9 3.0 1.4 0.2 setosa TRUE
# 3 4.7 3.2 1.3 0.2 setosa TRUE
# 4 4.6 3.1 1.5 0.2 setosa TRUE
# 5 5.0 3.6 1.4 0.2 setosa FALSE
# 6 5.4 3.9 1.7 0.4 setosa FALSE
Assigning row_number() as a column value over a list of dataframes in R
You can use lapply
and cbind
to assign rownumbers:
lapply(df_list, function(x) cbind(x, rank=seq_len(NROW((x)))))
#[[1]]
# a b rank
#1 A 12 1
#2 B 8 2
#3 C 9 3
#4 D 5 4
#5 E 6 5
#6 F 18 6
#
#[[2]]
# c b rank
#1 K 12 1
#2 J 8 2
#3 L 9 3
#4 M 5 4
#5 N 6 5
#6 O 18 6
Syntax error in SQL code being used in R, no row_number function found
It looks like you are missing a comma after the *
. The following works as expected:
DBI::dbGetQuery(con, '
SELECT *,
row_number() over(partition by cyl) AS rn
FROM sql_mtcars
')
Note the addition of a comma after the *
and before the row_number
.
Related Topics
Visualizing Distance Between Nodes According to Weights - with R
Robust Standard Errors for Mixed-Effects Models in Lme4 Package of R
Manually Set Order of Fill Bars in Arbitrary Order Using Ggplot2
Merge Data Based on Nearest Date R
Small Ggplot Object (1 Mb) Turns into 7 Gigabyte .Rdata Object When Saved
Ggplot2 Log Transformation for Data and Scales
Round_Any Equivalent for Dplyr
Change Thickness of a Marker in Ggplot2
Check If a Program Is Installed
How to Create a Continuous Legend (Color Bar Style) for Scale_Alpha
Using If Else on a Dataframe Across Multiple Columns
R -Apply- Convert Many Columns from Numeric to Factor
How to Filter Cases in a Data.Table by Multiple Conditions Defined in Another Data.Table
Getting Stargazer Column Labels to Print on Two or Three Lines
When/How/Where Is Parent.Frame in a Default Argument Interpreted