R: 'Split' Preserving Natural Order of Factors

R: `split` preserving natural order of factors

split converts the f (second) argument to factors, if it isn't already one. So, if you want the order to be retained, factor the column yourself with the desired level. That is:

df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
# now split
split(df, df$yearmon)
# $`4_2013`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 1 2013-04-01 INDUSINDBK SIEMENS 4_2013
# 2 2013-04-01 NMDC WIPRO 4_2013

# $`9_2012`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 3 2012-09-28 LUPIN SAIL 9_2012
# 4 2012-09-28 ULTRACEMCO STER 9_2012

# $`4_2012`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 5 2012-04-27 ASIANPAINT RCOM 4_2012
# 6 2012-04-27 BANKBARODA RPOWER 4_2012

But do not use split. Use data.table instead:

However normally, split tends to be terribly slow as the levels increase. So, I'd suggest using data.table to subset to a list. I'd suppose that'd be much faster!

require(data.table)
dt <- data.table(df)
dt[, grp := .GRP, by = yearmon]
setkey(dt, grp)
o2 <- dt[, list(list(.SD)), by = grp]$V1

Benchmarking on huge data:

set.seed(45)
dates <- seq(as.Date("1900-01-01"), as.Date("2013-12-31"), by = "days")
ym <- do.call(paste, c(expand.grid(1:500, 1900:2013), sep="_"))

df <- data.frame(x1 = sample(dates, 1e4, TRUE),
x2 = sample(letters, 1e4, TRUE),
x3 = sample(10, 1e4, TRUE),
yearmon = sample(ym, 1e4, TRUE),
stringsAsFactors=FALSE)

require(data.table)
dt <- data.table(df)

f1 <- function(dt) {
dt[, grp := .GRP, by = yearmon]
setkey(dt, grp)

o1 <- dt[, list(list(.SD)), by=grp]$V1
}

f2 <- function(df) {
df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
o2 <- split(df, df$yearmon)
}

require(microbenchmark)
microbenchmark(o1 <- f1(dt), o2 <- f2(df), times = 10)

# Unit: milliseconds
expr min lq median uq max neval
# o1 <- f1(dt) 43.72995 43.85035 45.20087 715.1292 1071.976 10
# o2 <- f2(df) 4485.34205 4916.13633 5210.88376 5763.1667 6912.741 10

Note that the solution from o1 will be an unnamed list. But you can set the names simply by doing names(o1) <- unique(dt$yearmon)

keeping the original order of data after split() in R

We can split by factor converted 'study.name', where the levels are specified as the unique elements of the column and unique returns the values in the same order of occurrence of unique elements

split(D, factor(D$study.name, levels = unique(D$study.name)))

if we need to delete the NA elements, subset the data before the split

D1 <- subset(D, !(is.na(study.name)| study.name == ""))
split(D1, factor(D1$study.name, levels = unique(D1$study.name)))
#$Shin.Ellis
# study.name group.name n mpre mpos sdpre sdpos r autoreg t sdif F1 sdp df2 post control outcome ESL prof scope type
#1 Shin.Ellis ME.short 13 0.34 0.72 0.37 0.34 0.5 FALSE NA NA NA NA NA 1 FALSE 1 1 2 1 2
#2 Shin.Ellis ME.long 13 0.34 0.39 0.37 0.36 0.5 TRUE NA NA NA NA NA 2 FALSE 1 1 2 1 2
#3 Shin.Ellis DCF.Short 15 0.37 0.54 0.38 0.36 0.5 FALSE NA NA NA NA NA 1 FALSE 1 1 2 1 2
#4 Shin.Ellis DCF.Long 15 0.37 0.49 0.38 0.36 0.5 TRUE NA NA NA NA NA 2 FALSE 1 1 2 1 2
#5 Shin.Ellis Cont.Short 16 0.32 0.28 0.37 0.36 0.5 FALSE NA NA NA NA NA 1 TRUE 1 1 2 1 2
#6 Shin.Ellis Cont.Long 16 0.32 0.35 0.37 0.32 0.5 TRUE NA NA NA NA NA 2 TRUE 1 1 2 1 2

#$Trus.Hsu
# study.name group.name n mpre mpos sdpre sdpos r autoreg t sdif F1 sdp df2 post control outcome ESL prof scope type
#8 Trus.Hsu Exper 21 0.0799 0.1130 0.0367 0.0472 0.5 FALSE NA NA NA NA NA 1 FALSE 1 2 2 2 1
#9 Trus.Hsu Cont 26 0.0763 0.1095 0.0389 0.0537 0.5 FALSE NA NA NA NA NA 1 TRUE 1 2 2 2 1

#$kabla
# study.name group.name n mpre mpos sdpre sdpos r autoreg t sdif F1 sdp df2 post control outcome ESL prof scope type
#11 kabla ME.short 13 0.34 0.72 0.37 0.34 0.5 FALSE NA NA NA NA NA 1 FALSE 1 1 3 0 1
#12 kabla ME.long 13 0.34 0.39 0.37 0.36 0.5 FALSE NA NA NA NA NA 2 FALSE 1 1 3 0 1
#13 kabla DCF.Short 15 0.37 0.54 0.38 0.36 0.5 FALSE NA NA NA NA NA 1 FALSE 1 1 3 0 1
#14 kabla DCF.Long 15 0.37 0.49 0.38 0.36 0.5 FALSE NA NA NA NA NA 2 FALSE 1 1 3 0 1
#15 kabla Cont.Short 16 0.32 0.28 0.37 0.36 0.5 FALSE NA NA NA NA NA 1 TRUE 1 1 3 0 1
#16 kabla Cont.Long 16 0.32 0.35 0.37 0.32 0.5 FALSE NA NA NA NA NA 2 TRUE 1 1 3 0 1

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use

 library(tidyverse)

df <- data.frame(actual_duration=sample(100))

df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))

alternatively if you want to keep the list format

df %>% 
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

Split data frame based on group into list in defined order in R

We can do an order of the dataset first and then do the split on the 'cycle' by creating a factor with the levels specified as unique elements

t1 <- ts_df[order(ts_df$date),]
split(t1, factor(t1$cycle, levels = unique(t1$cycle)) )

Sort list of strings by order of numeric parts

If your paths has the same pattern and only last number changes then you can use mixedorder from gtools package; otherwise, think about using gsub and regular expression.

L[mixedorder(sapply(L, function(x) x[1], simplify=TRUE), decreasing=FALSE)]

L is the list containing your paths.

Example:

For the sample data provided below this would be the answer:

#Original List before sorting:
# > L
# [[1]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/22"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/30"
#
# [[2]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/0"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/9"
# [3] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/5"
#
# [[3]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/6"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/7"
# [3] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/8"
#

Sorted list based on the first element:

L[mixedorder(sapply(L, function(x) x[1], simplify=TRUE), decreasing=FALSE)]
# [[1]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/0"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/9"
# [3] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/5"
#
# [[2]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/6"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/7"
# [3] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/8"
#
# [[3]]
# [1] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/22"
# [2] "C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/30"
#

Sample Data

L <-
list(c("C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/22",
"C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/30"
), c("C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/0",
"C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/9",
"C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/5"
), c("C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/6",
"C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/7",
"C:\\Users\\agaz\\AppData\\Local\\Temp\\Rtmp0wZI21/008947515435900b4d1a0b8d/8"))

R: How to change item (or object) names of a data.table list?

Here's how I'd approach it at the moment.

require(data.table)
tmp = setDT(df)[, list(grp=list(.SD)), by=.(product, year), .SDcols=names(df)]
setattr(ans <- tmp$grp, 'names', paste(tmp$product, tmp$year, sep="."))
ans
# $b.2001
# product value year
# 1: b 7 2001
#
# $a.2001
# product value year
# 1: a 3 2001
#
# $b.2000
# product value year
# 1: b 10 2000
#
# $a.2000
# product value year
# 1: a 9 2000

I've added a FR #1389 to provide a split.data.table method, with which this should be possible in one step.

But in most cases, it's easier to deal with one data.frame/data.table instead of a list. So providing a bit more insight into what your downstream tasks are might help figure out if this is really necessary..

Divide data frame by vector, by rows and not by columns

You can just use the transpose function:

> df[,2:4] <- t(t(df[,2:4]) / div)
> df
type V1 V2 V3
1 A 0.1 0.01 0.001
2 B 0.1 0.01 0.001
3 C 0.1 0.01 0.001


Related Topics



Leave a reply



Submit