Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter

row not consolidating duplicates in R when using multiple months in Date Filter

The OP is wondering why the result is not consolidated for CUST_ID if more than one month of data is processed.

The reason is that the monthly files are read in and aggregated one by one but a final aggregation step is needed to consolidate over all months.

The code below is a simplified replacement of the double for loops. I have left out the code for testing for "fast filtering".

The first part creates a list of files to be processed. The second part does the processing.

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders), 
  pattern = "\\.txt$", 
  full.names = TRUE, 
  recursive = TRUE)

# read and aggregate each file separately
mastertable <- rbindlist(
  lapply(in_filenames, function(fn) {
    # code for "fast filter" test goes here
    message("Reading in ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate 
    temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
  })
)[
  # THIS IS THE MISSING STEP:
  # second aggregation for overall totals
  , lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]

Processing file: Raw Data/AA-CA/AA-CA 2017-01.txt
Processing file: Raw Data/AA-CA/AA-CA 2017-02.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-01.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-02.txt

mastertable

     CUST_ID Ext Sale
1: AK0010001  427.803
2: CO0020001 1540.300
3: CO0010001 -179.765

Note that chaining of data.table expressions is used here.

Edit 1:

By request of the OP, here is the complete code (except for the "fast filtering" stuff). There are some additional lines which where modified. They are marked with ### MODIFIED.

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data"   ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master"   ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28"))   ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
date_filter <- TRUE

##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders), 
  pattern = "\\.txt$", 
  full.names = TRUE, 
  recursive = TRUE)

# read and aggregate each file separetely
mastertable <- rbindlist(
  lapply(in_filenames, function(fn) {
    # code for fast filter test goes here
    message("Processing file: ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate by month
    temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
  })
)[
  # second aggregation overall
  , lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
# rm(mastertable)   ### MODIFIED

print(Sys.time()-starttime)

Edit 2

The OP has asked to include the "fast filter" code which I had omitted for brevity.

However, I have a different approach. Instead of reading the first line of each file to check if INVOICE_DT is within the given date_range my approach filters the file names. The file names contain the year-month in ISO 8601 format.

So, a vector of allowed year-month strings is constructed from the given date_range. Only those file names which contain one of the allowed year-month strings are selected for further processing.

However, selecting the proper files is only the first step. As the date-range may start or end right in the middel of a month, we need also to filter the rows of each processed file. This step is missing from OP's code.

library(data.table, warn.conflicts = FALSE)
library(magrittr)   ### MODIFIED
# library(lubridate, warn.conflicts = FALSE)   ### MODIFIED

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data"   ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master"   ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2017-01-01", "2017-02-14")   ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE   ### MODIFIED

##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders), 
  pattern = "\\.txt$", 
  full.names = TRUE, 
  recursive = TRUE)

# filter filenames, only 
selected_in_filenames <- 
  seq(as.Date(date_range[1]), 
      as.Date(date_range[2]), by = "1 month") %>% 
  format("%Y-%m") %>% 
  lapply(function(x) stringr::str_subset(in_filenames, x)) %>% 
  unlist()

# read and aggregate each file separetely
mastertable <- rbindlist(
  lapply(selected_in_filenames, function(fn) {
    message("Processing file: ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate file but filtered for date_range
    temptable[INVOICE_DT %between% date_range, 
              lapply(.SD, sum), by = .(CUST_ID, QTR = quarter(INVOICE_DT)), 
              .SDcols = c("Ext Sale")]
  })
)[
  # second aggregation overall
  , lapply(.SD, sum), by = .(CUST_ID, QTR), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable)   ### MODIFIED

print(Sys.time()-starttime)

mastertable

     CUST_ID QTR Ext Sale
1: AK0010001   1  209.970
2: CO0020001   1 1540.300
3: CO0010001   1   -1.565

Note that date_range <- c("2017-01-01", "2017-02-14") now ends mid of February.

keep most recent observations when there are duplicates in R

Using tidyverse

df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()

mark overlapping time periods for duplicate rows

Using tidyverse, this gives you a column with marked overlapping rows (only showing the last 3 columns).

Some remarks:

This only works for 2 duplicates (2 rows) because it's not defined what to do when one overlaps and the other doesn't (3+ rows).
Not having a range (only one date) is not defined. Right now it's returning NA.
It's not clear how to "combine" the non-overlapping rows.

library(dplyr)
library(tidyr)

df %>% 
  group_by(Dup_N) %>% 
  mutate( overlap=(lead(inschrijf_first)>inschrijf_first & 
                   lead(inschrijf_first)<inschrijf_last)|
                  (lead(inschrijf_last)>inschrijf_first & 
                   lead(inschrijf_last)<inschrijf_last) ) %>% 
  fill(overlap, .direction="downup") %>% 
  ungroup() %>%
  print(width=90)
# A tibble: 20 × 10
...    inschrijf_first inschrijf_last Dup_N overlap
...    <date>          <date>         <int> <lgl>  
...  1 2013-10-01      2019-04-01        19 FALSE  
...  2 2019-10-01      2020-10-01        19 FALSE  
...  3 2019-10-01      2020-01-01       501 FALSE  
...  4 2020-07-01      2020-10-01       501 FALSE  
...  5 2014-07-01      2020-07-01       557 NA     
...  6 2020-10-01      NA               557 NA     
...  7 2014-01-01      2020-10-01       661 FALSE  
...  8 2013-01-01      2013-07-01       661 FALSE  
...  9 2006-01-01      2015-01-01       975 FALSE  
... 10 2018-07-01      2020-10-01       975 FALSE  
... 11 2015-07-01      2016-01-01      1067 FALSE  
... 12 2016-10-01      2020-10-01      1067 FALSE  
... 13 2014-04-01      NA              1079 NA     
... 14 2014-07-01      2016-01-01      1079 NA     
... 15 2006-01-01      2016-04-01      1195 TRUE   
... 16 2012-07-01      2016-01-01      1195 TRUE   
... 17 2014-01-01      2015-01-01      1352 NA     
... 18 2020-10-01      NA              1352 NA     
... 19 2014-01-01      2015-10-01      1355 FALSE  
... 20 2016-01-01      2020-10-01      1355 FALSE

merging incomplete duplicate rows

If we want to sample a row after grouping by 'dates', 'co.name', we can use that in slice

library(dplyr)
df %>%
   group_by(dates, co.name) %>% 
   slice(sample(row_number(), 1))

Or with sample_n

df %>% 
   group_by(dates, co.name) %>% 
   sample_n(1)

Identify duplicates based on duplicated data and some additional conditions, using `dplyr`

Does this work:

df %>% group_by(category, name, date, checkup_complete) %>% filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)
# A tibble: 9 x 5
# Groups:   category, name, date, checkup_complete [8]
  index name  category date  checkup_complete
  <dbl> <chr> <chr>    <chr> <chr>           
1     2 name1 cat1     date1 N               
2     3 name1 cat2     date1 Y               
3     4 name1 cat1     date1 Y               
4     5 name1 cat1     date2 N               
5     6 name2 cat1     date1 Y               
6     7 name3 cat1     date2 Y               
7     9 name3 cat1     date2 N               
8    10 name2 cat2     date1 Y               
9    11 name1 cat1     date1 N

group_by and filter removes too many rows

You need to use any to check for the presence of c("B", "C") within each group and not at each row; see below:

library(dplyr)

df %>% 
  group_by(x,y) %>% 
  mutate(unique_z = n_distinct(z)) %>% 
  filter(unique_z > 1,
         any(z %in% c("B","C")))
## any(z %in% c("C")) & any(z %in% c("B"))) 
## use this one instead if you want B and C present at the same time ...
## ... and two B's or two C's are not desired

# # A tibble: 3 x 4
# # Groups:   x, y [1]
#   x        y          z     unique_z
#   <fct>    <fct>      <fct>    <int>
# 1 Person A 2022-02-01 A            3
# 2 Person A 2022-02-01 C            3
# 3 Person A 2022-02-01 B            3

Consolidate duplicate rows

This works:

library(plyr)
ddply(df,"x",numcolwise(sum))

in words: (1) split the data frame df by the "x" column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd in ddply stands for "take a d ata frame as input, return a d ata frame")

Another, possibly clearer, approach:

aggregate(y~x,data=df,FUN=sum)

See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.

How do I duplicate and add rows between the values of two different columns?

I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.

Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.

library(dplyr)
library(tidyr)

new_data <- mydata %>% 
  group_by(stateabb, styear) %>% 
  tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>% 
  inner_join(mydata) %>% 
  rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
  ungroup()

The top and bottom of the USA rows:

new_data %>% 
  filter(stateabb == "USA") %>% 
  head()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             1898     2       8    13    2016       12     31    2016
2   1898 USA             1899     2       8    13    2016       12     31    2016
3   1898 USA             1900     2       8    13    2016       12     31    2016
4   1898 USA             1901     2       8    13    2016       12     31    2016
5   1898 USA             1902     2       8    13    2016       12     31    2016
6   1898 USA             1903     2       8    13    2016       12     31    2016

new_data %>% 
  filter(stateabb == "USA") %>% 
  tail()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             2011     2       8    13    2016       12     31    2016
2   1898 USA             2012     2       8    13    2016       12     31    2016
3   1898 USA             2013     2       8    13    2016       12     31    2016
4   1898 USA             2014     2       8    13    2016       12     31    2016
5   1898 USA             2015     2       8    13    2016       12     31    2016
6   1898 USA             2016     2       8    13    2016       12     31    2016

Your example data:

mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY", 
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L, 
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L, 
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L, 
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L, 
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L, 
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L, 
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L, 
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L, 
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L, 
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14"))

R: Consolidating duplicate observations?

Using basic R:

df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
                day = c("1","2","3","4","5","6","7","8"),
                year = c(2016,2017,2017,2016,2017,2016,2017,2016),
                stringsAsFactors = F)

> df
  ID day year
1  a   1 2016
2  a   2 2017
3  b   3 2017
4  b   4 2016
5  b   5 2017
6  c   6 2016
7  d   7 2017
8  d   8 2016

Do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 
              function(x){ paste0(x, collapse = "/") }
              )

Result:

> z
  id   day           year
1  a   1/2      2016/2017
2  b 3/4/5 2017/2016/2017
3  c     6           2016
4  d   7/8      2017/2016

EDIT

If you want to avoid "collapsing" NA do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 

        function(x){ paste0(x[!is.na(x)],collapse = "/") })

For a data frame like:

> df
  ID  day year
1  a    1 2016
2  a    2   NA
3  b    3 2017
4  b    4 2016
5  b <NA> 2017
6  c    6 2016
7  d    7 2017
8  d    8 2016

The result is:

> z
  id day           year
1  a 1/2           2016
2  b 3/4 2017/2016/2017
3  c   6           2016
4  d 7/8      2017/2016

Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter