Fast Alternative to Split in R

Fast alternative to split in R

Split indexes into pop

idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))

Split is not slow, e.g.,

> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
user system elapsed
1.056 0.000 1.058

so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated

> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10

or perhaps you're running out of memory.

Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).

par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)

Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).

Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.

Faster alternative to split-apply-combine

As per the comments, my suspicion about overhead is wrong. The inner function takes ~7 microseconds to execute, and .007 * 4800 = 33.6 seconds.

So with regard to:

Split-apply-combine with plyr::dlply seems to be inefficient because of the overhead required to split and combine. Am I mistaken, or is there a better/faster way?

The answer is that

It would probably be unreasonable to expect serious speedups without making the inner function faster.

I am, in fact, mistaken.

More memory efficient way than strsplit() to split a string into two in R

Here is a quick comparison of different methods to do this:

library(stringi)
library(dplyr)

# get some sample data
set.seed(1)
long_string <- stri_paste(stri_rand_lipsum(10000), collapse = " ")
x <- sample(9000:11000, 1)
split_string <- substr(long_string, x, x + 49)

result <- long_string %>% strsplit(., split_string)
length(unlist(result))
#> [1] 2

substr_fun <- function(str, pattern) {
idx <- regexpr(pattern, str, fixed = TRUE)
res1 <- list(c(substr(str, 1, idx-1), substr(str, idx + attr(idx, "match.length"), nchar(str))))
return(res1)
}

bench::mark(
strsplit_dplyr = long_string %>% strsplit(., split_string),
strsplit_dplyr_fixed = long_string %>% strsplit(., split_string, fixed = TRUE),
strsplit = strsplit(long_string, split_string),
strsplit_fixed = strsplit(long_string, split_string, fixed = TRUE),
stri_split_fixed = stringi::stri_split_fixed(long_string, split_string),
str_split = stringr::str_split(long_string, stringr::fixed(split_string)),
substr_fun = substr_fun(long_string, split_string)
)
#> # A tibble: 7 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 strsplit_dplyr 131ms 134.8ms 7.44 280B 0
#> 2 strsplit_dplyr_fixed 36.6ms 37.6ms 26.5 280B 0
#> 3 strsplit 133ms 133.8ms 7.40 0B 0
#> 4 strsplit_fixed 35.4ms 37.2ms 26.7 0B 0
#> 5 stri_split_fixed 40.7ms 42.5ms 23.6 6.95KB 0
#> 6 str_split 41.6ms 43.1ms 23.4 35.95KB 0
#> 7 substr_fun 13.6ms 14.8ms 67.1 0B 0

In terms of memory usage, strsplit with the option fixed = TRUE and without the overhead from piping is the best solution. The implementations in stringi and stringr seem to be a little faster but their overhead in terms of memory is even larger than the effect from piping.

Update

I added the method from @H 1 answer and also his approach to get a 50 character substring to use for splitting. Only change is I wrapped it in a function and added fixed = TRUE again since I think it makes more sense in this case.

The new function is the clear winner if you do not want to make more than one split in your string!

Efficient way to split a huge string in R

In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract with some regex:

library(dplyr)
library(stringr)
library(glue)

mydf |>
mutate(next_title = lead(title, default = "$")) |>
mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |>
select(-next_title)

Yielding:

page    title                                      text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 maecenas habitasse ultrices aenean tempus

If performance is a concern, a similar approach with data.table would be:

library(data.table)
library(stringr)
library(glue)

mydt <- setDT(mydf)

mydt[, next_title :=shift(title, fill = "$", type = "lead")][
,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
!("next_title")]

Resulting in:

   page    title                                      text
1: 1 Lorem ipsum dolor sit amet, sollicitudin duis
2: 2 maecenas habitasse ultrices aenean tempus

EDIT

Added for better performance options:

Generally, str_split or str_split_fixed will be a faster way to go than str_extract.

The problem for str_split is that a regex with many alternate pipes will also slow down the process, so another solution would be to replace all the titles in the string first with some fixed character string, and then split on those. Another thing you can do to speed up the splitting is use str_split_fixed and pre-assign how many splits to process.

    # create named character vector for str_replace_all function
split_at <- rep("@@",nrow(mydf))
names(split_at) <- mydf$title
mystring <- str_replace_all(mystring, split_at)

# used fixed in str_split
mydf$text <- str_split(mystring,fixed("@@ "))[[1]][-1]

# Alternative (maybe faster) define number of splits by nrow
mydf$text <- str_split_fixed(mystring,fixed("@@ "), n = nrow(mydf)+1)[,-1]

## using str_split_fixed in data.table
mydt <- setDT(mydf)
mydt[, text :=
str_split_fixed(mystring,fixed("@@ "), nrow(mydt)+1)[,-1]

Why is split inefficient on large data frames with many groups?

This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups.

You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

system.time(
l1 <- df %>% split(.$x)
)
# user system elapsed
#200.936 0.000 217.496
library(data.table)
dt = as.data.table(df)
system.time(
l2 <- split(dt, by="x")
)
# user system elapsed
# 7.372 0.000 6.875
system.time(
l3 <- split(dt, by="x", sorted=TRUE)
)
# user system elapsed
# 9.068 0.000 8.200

sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

More about that in Installation wiki.

Slow performance of split() function

I agree with @Roland's comment. To illustrate, here is an example.

  1. Let's generate some data with 200000 entries in one minute time intervals.

    set.seed(2018);
    df <- data.frame(
    date = seq(from = as.POSIXct("2018-01-01 00:00"), by = "min", length.out = 200000),
    amount = runif(200000))
    head(df);
    # date amount
    #1 2018-01-01 00:00:00 0.33615347
    #2 2018-01-01 00:01:00 0.46372327
    #3 2018-01-01 00:02:00 0.06058539
    #4 2018-01-01 00:03:00 0.19743361
    #5 2018-01-01 00:04:00 0.47431419
    #6 2018-01-01 00:05:00 0.30104860
  2. We now (1) create a new column date_hour that includes the date & hour part of the full date&time, (2) group_by column date_hour, and (3) sum entries from column amount to give amount.sum.

    df %>%
    mutate(date_hour = format(date, "%Y-%m-%d %H")) %>%
    group_by(date_hour) %>%
    summarise(amount.sum = sum(amount))
    ## A tibble: 3,333 x 2
    # date_hour amount.sum
    # <chr> <dbl>
    # 1 2018-01-01 00 28.9
    # 2 2018-01-01 01 26.4
    # 3 2018-01-01 02 32.7
    # 4 2018-01-01 03 29.9
    # 5 2018-01-01 04 29.7
    # 6 2018-01-01 05 28.5
    # 7 2018-01-01 06 34.2
    # 8 2018-01-01 07 33.8
    # 9 2018-01-01 08 30.7
    #10 2018-01-01 09 27.7
    ## ... with 3,323 more rows

This is very fast (it takes around 0.3 seconds on my 2012 MacBook Air), and you should be able to easily adjust this example to your particular case.

R Looking for faster alternative for sapply()

Depending on your might want to consider alternative packages (while ngram proclaims to be fast). The fastest alternative here (while ng = 1) is to split the word and find unique indices.

stringi_get_unigrams <- function(text)
lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
# user system elapsed
# 0.84 0.00 0.86

If you want to be more complex (eg. ng != 1) you'd need to compare all pairwise combinations of string, which is a bit more complex.

stringi_get_duograms <- function(text){
splits <- stri_split(text, fixed = " ")
comp <- function(x)
nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
res <- sapply(splits, comp)
res[res == 0] <- NA_integer_
res
}
system.time(res <- stringi_get_duograms(df$text))
# user system elapsed
# 5.94 0.02 5.93

Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.

Times on my CPU

system.time({
res <- get_unigrams(df$text)
})
# user system elapsed
# 12.72 0.16 12.94

alternative parallel implementation:

get_unigrams_par <- function(text) {
require(purrr)
require(ngram)
sapply(text, function(text)
ngram(text, n = 1) %>% get.ngrams() %>% length()
)
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
get_unigrams_par)))
)
# user system elapsed
# 0.20 0.11 2.95
parallel::stopCluster(cl)

And just to check that all results are identical:

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

Edit:

Of course there's nothing stopping us from combining parallelization with any result above:

cl <- parallel::makeCluster(nc <- parallel::detectCores())
clusterEvalQ(cl, library(stringi))
system.time(
res4 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
stringi_get_unigrams)))
)
# user system elapsed
# 0.01 0.16 0.27
stopCluster(cl)

Faster function than aggregate() in R

Im sure the real data is much larger but your solution seems on-point. as some alternatives I benchmarked other approaches:

Tidyverse

tidy_fn <- function(){
rbind(old.data, new.data) %>% group_by(id) %>% dplyr::summarise_all(
function(x)sum(x)
)
}

Plyr and base functions (I know..bad-form)

plyr_base_fn <- function(){

plyr::ldply(Map(function(x){
sapply(x[1:3],sum)
}, rbind(old.data,new.data) %>% split(., .$id)
))

}

Your aggregation approach:

agg_fn <- function(){
aggregate(cbind(x,y,z)~id, rbind(old.data, new.data), sum, na.rm=F)
}

Results from two tests:

1000 reps

> microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.220585 2.386112 2.823122 2.529649 2.775300 13.425573 1000
agg_fn() 1.668601 1.795527 2.149068 1.895666 2.062904 16.117802 1000
plyr_base_fn() 1.253772 1.331501 1.567777 1.402464 1.526089 8.396307 1000
5000 reps

microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 5000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.227752 2.400265 2.696034 2.542617 2.722082 12.46249 5000
agg_fn() 1.673647 1.792085 2.067232 1.897011 2.019915 301.84694 5000
plyr_base_fn() 1.247306 1.336010 1.503682 1.411608 1.503290 14.24656 5000


Related Topics



Leave a reply



Submit