﻿ Expand Ranges Defined by "From" and "To" Columns - ITCodar

# Expand Ranges Defined by "From" and "To" Columns

## Expand ranges defined by from and to columns

You can use the `plyr` package:

``library(plyr)ddply(presidents, "name", summarise, year = seq(from, to))#              name year# 1    Barack Obama 2009# 2    Barack Obama 2010# 3    Barack Obama 2011# 4    Barack Obama 2012# 5    Bill Clinton 1993# 6    Bill Clinton 1994# [...]``

and if it is important that the data be sorted by year, you can use the `arrange` function:

``df <- ddply(presidents, "name", summarise, year = seq(from, to))arrange(df, df\$year)#              name year# 1    Bill Clinton 1993# 2    Bill Clinton 1994# 3    Bill Clinton 1995# [...]# 21   Barack Obama 2011# 22   Barack Obama 2012``

Edit 1: Following's @edgester's "Update 1", a more appropriate approach is to use `adply` to account for presidents with non-consecutive terms:

``adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]``

## expand a data frame to have as many rows as range of two columns in original row

With `dplyr`, we can use `rowwise` with `do`

``library(dplyr)df1 %>%    rowwise() %>%    do(data.frame(symbol= .\$symbol, value = .\$start:.\$end)) %>%    arrange(symbol)# A tibble: 30 x 2#   symbol value#    <chr> <int># 1      a     7# 2      a     8# 3      a     9# 4      a    10# 5      a    11# 6      i     8# 7      i     9# 8      i    10# 9      i    11#10      i    12# ... with 20 more rows``

## Expand number range to the individual numbers

I have a `data.table` solution in mind.

I made the hypothesis that your `label` var is unique by observation. Otherwise, you should use a row number to group your data.

``library(data.table)df <- data.frame(start = c(10, 20), end = c(15,33), label = c('ex1','ex2'))setDT(df)df[, seq(.SD[['start']], .SD[['end']]), by = label]label V1 1:   ex1 10 2:   ex1 11 3:   ex1 12 4:   ex1 13 5:   ex1 14 6:   ex1 15 7:   ex2 20 8:   ex2 21 9:   ex2 2210:   ex2 2311:   ex2 2412:   ex2 2513:   ex2 2614:   ex2 2715:   ex2 2816:   ex2 2917:   ex2 3018:   ex2 3119:   ex2 3220:   ex2 33``

In terms of efficiency, it might be hard to find a solution faster than `data.table` that is designed to that end.

If you can't use `label` as a unique identifier, you can do

``df[,'rn' := seq(.N)]df[, seq(.SD[['start']], .SD[['end']]), by = c('rn','label')]    rn label V1 1:  1   ex1 10 2:  1   ex1 11 3:  1   ex1 12 4:  1   ex1 13 5:  1   ex1 14 6:  1   ex1 15 7:  2   ex2 20 8:  2   ex2 21 9:  2   ex2 2210:  2   ex2 2311:  2   ex2 2412:  2   ex2 2513:  2   ex2 2614:  2   ex2 2715:  2   ex2 2816:  2   ex2 2917:  2   ex2 3018:  2   ex2 3119:  2   ex2 3220:  2   ex2 33``

and you can drop the intermediate row number using `df[,'rn' := NULL]`

### Efficiency

`data.table` brings a good speedup (does not matter that much if you use one or two columns to group in this example)

``Unit: microseconds                                                           expr      min       lq     mean   median       uq                                  df %>% rowwise() %>% do(f(.)) 1549.408 1808.669 2309.332 2292.525 2555.888          df[, seq(.SD[["start"]], .SD[["end"]]), by = "label"] 1011.608 1302.249 1555.808 1490.542 1779.543 df[, seq(.SD[["start"]], .SD[["end"]]), by = c("label", "rn")]  968.124 1095.703 1387.556 1253.023 1592.483      max neval cld 7141.964   100   b 3061.487   100  a  2953.598   100  a ``

If you want to go even faster, you can set a key (`?setkeyv`). If your dataframe is of significant size, this might bring huge performance gains (in this small example it won't)

## Expand range of dates by another column by inserting rows in R

Here's a very pedestrian way of doing it:

``do.call(rbind, lapply(split(df, seq_along(df\$idnum)), function(x) {                             if(x\$between[1] == x\$end[1]) return(x)                            x <-  x[c(1, 1),]                            x\$end[1] <- x\$between[1]                            x\$start[2] <- x\$between[1] + 1                            x\$between[2] <- NA                            x}))#>       idnum var      start        end    between#> 1.1      17   A 1993-03-01 1993-03-01 1993-03-01#> 1.1.1    17   A 1993-03-02 1993-03-12       <NA>#> 2.2      17   B 1993-01-02 1993-04-03 1993-04-03#> 2.2.1    17   B 1993-04-04 1993-04-09       <NA>#> 3        20   A 1993-02-01 1993-02-01 1993-02-01#> 4.4      21   C 1993-05-09 1993-07-10 1993-07-10#> 4.4.1    21   C 1993-07-11 1993-07-12       <NA>``

Created on 2020-07-26 by the reprex package (v0.3.0)

## Expand data set to fill in with sequential values in R

We can get the rowwise sequence from 'Score2_Min' to 'Score2_Max' with `map2` in a `list` column and then `unnest` the `list` column

``library(dplyr)library(tidyr)library(purrr)data %>%     transmute(Score1, Score2 = map2(Score2_Min, Score2_Max, `:`)) %>%    unnest(Score2)# A tibble: 14 x 2#   Score1 Score2#    <dbl>  <int># 1    286    108# 2    286    109# 3    286    110# 4    286    111# 5    287    112# 6    287    113# 7    288    112# 8    288    113# 9    289    112#10    289    113#11    290    112#12    290    113#13    291    112#14    291    113``

## Split a column consisting of number range and use the resulting numbers as range values in R

We can split the 'Speed' into two column with `separate`, then create a sequence `list` column based on the values of 'start', 'end' and `unnest` the `list` column

``library(dplyr)library(tidyr)library(purrr)df1 %>%   separate(Speed, into = c('start', 'end'), remove = FALSE, convert = TRUE) %>%    mutate(AcutalSpeed  = map2(start, end, `:`), start = NULL, end = NULL) %>%    unnest(c(AcutalSpeed))# A tibble: 101 x 3#   Speed SpeedLevel AcutalSpeed#   <chr>      <dbl>       <int># 1 0-20           1           0# 2 0-20           1           1# 3 0-20           1           2# 4 0-20           1           3# 5 0-20           1           4# 6 0-20           1           5# 7 0-20           1           6# 8 0-20           1           7# 9 0-20           1           8#10 0-20           1           9# … with 91 more rows``

Submit