How to Ddply() Without Sorting

How to ddply() without sorting?

This came up on the plyr mailing list a while back (raised by @kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:

#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}

#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")

#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d

#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d

Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.

Edited to include a strategy for more general cases

If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.

For instance, consider the following data:

d <- data.frame(x1 = rep(letters[1:3],each = 5), 
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)

using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:

> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27

> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27

If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:

d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)

Now when we use ddply, the resulting sort will be as we intend:

> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25

The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.

R Plyr - Ordering results from DDPLY?

I'll use this occasion to advertise a bit for data.table, which is faster to run and (in my perception) at least as elegant to write:

library(data.table)
ddims <- data.table(diamonds)
system.time(ddims <- ddims[, list(depth=mean(depth), table=mean(table)), by=color][order(depth)])

user system elapsed
0.003 0.000 0.004

By contrast, without ordering, your ddply code already takes 30 times longer:

  user  system elapsed 
0.106 0.010 0.119

With all the respect I have for Hadley's excellent work, e.g. on ggplot2, and general awesomeness, I must confess that for me, data.table entirely replaced ddply -- for speed reasons.

ddply for entire data with no groups?

The function ddply will accept an "empty" grouping variable and perform the analysis on the entire table.

With subgroups:

ddply(baseball, .(lg), c("nrow", "ncol"))
lg nrow ncol
1 65 22
2 AA 171 22
3 AL 10007 22
4 FL 37 22
5 NL 11378 22
6 PL 32 22
7 UA 9 22

Without subgroups:

ddply(baseball, .(), c("nrow", "ncol"))
.id nrow ncol
1 <NA> 21699 22

Keeping all columns when using ddply

Similar answer to @bramtayl, but also using a filter.

> library(dplyr)

> new_df <- x %>%
+ group_by(X) %>%
+ mutate(myDate = as.Date(myDate, format = '%d.%m.%y')) %>%
+ filter(myDate == min(myDate))

> new_df
Source: local data frame [3 x 5]
Groups: X [3]

X Y c d myDate
(fctr) (fctr) (fctr) (fctr) (date)
1 a1 14 cd a 2012-05-04
2 c1 15 ss g 2010-09-09
3 b1 12 aa p 2012-02-01

> unique(x$X) %>% length == nrow(new_df)
[1] TRUE

> unique(x$X) %>% length == length(new_df)
[1] FALSE

Sorting several dates by one observation

Using plyr package:

ddply(dat,.(business_id),function(x)
if(length(x$date)>1)
diff(range(as.POSIXct(x$date)))
else 0)

business_id V1
1 6TWRuHn24DL6vnW8Uyu4Vw 0
2 DxUn-ukNL27GOuwjnFGFKA 0
3 FV0BkoGOd3Yu_eJnXY15ZA 0
4 pF7uRzygyZsltbmVpjIyvw 0
5 PlcCjELzSI3SqX7mPF5cCw 0
6 Trar_9cFAj6wXiXfKfEqZA 0
7 XkNQVTkCEzBrq7OlRHI11Q 692
8 Xo9Im4LmIhQrzJcO4R3ZbA 0
9 xsSnuGCCJD4OgWnOZ0zB4A 0
10 Z67obTep38V9HMtA10yu5A 0

Passing a function argument to ddply

You are passign a string name "Species" to a ddply function. So you should get it's value inside. Then ddply recognize column name

library(plyr)
IG_test <-function(data, feature){
dd<-ddply(data, feature, here(summarise), N=length(get(feature)))
return(dd)
}

IG_test(iris, "Species")


Related Topics



Leave a reply



Submit