Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter

row not consolidating duplicates in R when using multiple months in Date Filter

The OP is wondering why the result is not consolidated for CUST_ID if more than one month of data is processed.

The reason is that the monthly files are read in and aggregated one by one but a final aggregation step is needed to consolidate over all months.

The code below is a simplified replacement of the double for loops. I have left out the code for testing for "fast filtering".

The first part creates a list of files to be processed. The second part does the processing.

# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)

# read and aggregate each file separately
mastertable <- rbindlist(
lapply(in_filenames, function(fn) {
# code for "fast filter" test goes here
message("Reading in ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate
temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
})
)[
# THIS IS THE MISSING STEP:
# second aggregation for overall totals
, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
Processing file: Raw Data/AA-CA/AA-CA 2017-01.txt
Processing file: Raw Data/AA-CA/AA-CA 2017-02.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-01.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-02.txt
mastertable
     CUST_ID Ext Sale
1: AK0010001 427.803
2: CO0020001 1540.300
3: CO0010001 -179.765

Note that chaining of data.table expressions is used here.



Edit 1:

By request of the OP, here is the complete code (except for the "fast filtering" stuff). There are some additional lines which where modified. They are marked with ### MODIFIED.

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data" ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master" ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28")) ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
date_filter <- TRUE

##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)

# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(in_filenames, function(fn) {
# code for fast filter test goes here
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate by month
temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
})
)[
# second aggregation overall
, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
# rm(mastertable) ### MODIFIED

print(Sys.time()-starttime)


Edit 2

The OP has asked to include the "fast filter" code which I had omitted for brevity.

However, I have a different approach. Instead of reading the first line of each file to check if INVOICE_DT is within the given date_range my approach filters the file names. The file names contain the year-month in ISO 8601 format.

So, a vector of allowed year-month strings is constructed from the given date_range. Only those file names which contain one of the allowed year-month strings are selected for further processing.

However, selecting the proper files is only the first step. As the date-range may start or end right in the middel of a month, we need also to filter the rows of each processed file. This step is missing from OP's code.

library(data.table, warn.conflicts = FALSE)
library(magrittr) ### MODIFIED
# library(lubridate, warn.conflicts = FALSE) ### MODIFIED

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data" ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master" ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2017-01-01", "2017-02-14") ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE ### MODIFIED

##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)

# filter filenames, only
selected_in_filenames <-
seq(as.Date(date_range[1]),
as.Date(date_range[2]), by = "1 month") %>%
format("%Y-%m") %>%
lapply(function(x) stringr::str_subset(in_filenames, x)) %>%
unlist()

# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(selected_in_filenames, function(fn) {
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate file but filtered for date_range
temptable[INVOICE_DT %between% date_range,
lapply(.SD, sum), by = .(CUST_ID, QTR = quarter(INVOICE_DT)),
.SDcols = c("Ext Sale")]
})
)[
# second aggregation overall
, lapply(.SD, sum), by = .(CUST_ID, QTR), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable) ### MODIFIED

print(Sys.time()-starttime)

mastertable
     CUST_ID QTR Ext Sale
1: AK0010001 1 209.970
2: CO0020001 1 1540.300
3: CO0010001 1 -1.565

Note that date_range <- c("2017-01-01", "2017-02-14") now ends mid of February.

keep most recent observations when there are duplicates in R

Using tidyverse

df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()

mark overlapping time periods for duplicate rows

Using tidyverse, this gives you a column with marked overlapping rows (only showing the last 3 columns).

Some remarks:

  • This only works for 2 duplicates (2 rows) because it's not defined what to do when one overlaps and the other doesn't (3+ rows).
  • Not having a range (only one date) is not defined. Right now it's returning NA.
  • It's not clear how to "combine" the non-overlapping rows.
library(dplyr)
library(tidyr)

df %>%
group_by(Dup_N) %>%
mutate( overlap=(lead(inschrijf_first)>inschrijf_first &
lead(inschrijf_first)<inschrijf_last)|
(lead(inschrijf_last)>inschrijf_first &
lead(inschrijf_last)<inschrijf_last) ) %>%
fill(overlap, .direction="downup") %>%
ungroup() %>%
print(width=90)
# A tibble: 20 × 10
... inschrijf_first inschrijf_last Dup_N overlap
... <date> <date> <int> <lgl>
... 1 2013-10-01 2019-04-01 19 FALSE
... 2 2019-10-01 2020-10-01 19 FALSE
... 3 2019-10-01 2020-01-01 501 FALSE
... 4 2020-07-01 2020-10-01 501 FALSE
... 5 2014-07-01 2020-07-01 557 NA
... 6 2020-10-01 NA 557 NA
... 7 2014-01-01 2020-10-01 661 FALSE
... 8 2013-01-01 2013-07-01 661 FALSE
... 9 2006-01-01 2015-01-01 975 FALSE
... 10 2018-07-01 2020-10-01 975 FALSE
... 11 2015-07-01 2016-01-01 1067 FALSE
... 12 2016-10-01 2020-10-01 1067 FALSE
... 13 2014-04-01 NA 1079 NA
... 14 2014-07-01 2016-01-01 1079 NA
... 15 2006-01-01 2016-04-01 1195 TRUE
... 16 2012-07-01 2016-01-01 1195 TRUE
... 17 2014-01-01 2015-01-01 1352 NA
... 18 2020-10-01 NA 1352 NA
... 19 2014-01-01 2015-10-01 1355 FALSE
... 20 2016-01-01 2020-10-01 1355 FALSE

merging incomplete duplicate rows

If we want to sample a row after grouping by 'dates', 'co.name', we can use that in slice

library(dplyr)
df %>%
group_by(dates, co.name) %>%
slice(sample(row_number(), 1))

Or with sample_n

df %>% 
group_by(dates, co.name) %>%
sample_n(1)

Identify duplicates based on duplicated data and some additional conditions, using `dplyr`

Does this work:

df %>% group_by(category, name, date, checkup_complete) %>% filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)
# A tibble: 9 x 5
# Groups: category, name, date, checkup_complete [8]
index name category date checkup_complete
<dbl> <chr> <chr> <chr> <chr>
1 2 name1 cat1 date1 N
2 3 name1 cat2 date1 Y
3 4 name1 cat1 date1 Y
4 5 name1 cat1 date2 N
5 6 name2 cat1 date1 Y
6 7 name3 cat1 date2 Y
7 9 name3 cat1 date2 N
8 10 name2 cat2 date1 Y
9 11 name1 cat1 date1 N

group_by and filter removes too many rows

You need to use any to check for the presence of c("B", "C") within each group and not at each row; see below:

library(dplyr)

df %>%
group_by(x,y) %>%
mutate(unique_z = n_distinct(z)) %>%
filter(unique_z > 1,
any(z %in% c("B","C")))
## any(z %in% c("C")) & any(z %in% c("B")))
## use this one instead if you want B and C present at the same time ...
## ... and two B's or two C's are not desired

# # A tibble: 3 x 4
# # Groups: x, y [1]
# x y z unique_z
# <fct> <fct> <fct> <int>
# 1 Person A 2022-02-01 A 3
# 2 Person A 2022-02-01 C 3
# 3 Person A 2022-02-01 B 3

Consolidate duplicate rows

This works:

library(plyr)
ddply(df,"x",numcolwise(sum))

in words: (1) split the data frame df by the "x" column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd in ddply stands for "take a d ata frame as input, return a d ata frame")

Another, possibly clearer, approach:

aggregate(y~x,data=df,FUN=sum)

See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.

How do I duplicate and add rows between the values of two different columns?

I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.

Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.

library(dplyr)
library(tidyr)

new_data <- mydata %>%
group_by(stateabb, styear) %>%
tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>%
inner_join(mydata) %>%
rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
ungroup()

The top and bottom of the USA rows:

new_data %>% 
filter(stateabb == "USA") %>%
head()

# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 1898 2 8 13 2016 12 31 2016
2 1898 USA 1899 2 8 13 2016 12 31 2016
3 1898 USA 1900 2 8 13 2016 12 31 2016
4 1898 USA 1901 2 8 13 2016 12 31 2016
5 1898 USA 1902 2 8 13 2016 12 31 2016
6 1898 USA 1903 2 8 13 2016 12 31 2016

new_data %>%
filter(stateabb == "USA") %>%
tail()

# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 2011 2 8 13 2016 12 31 2016
2 1898 USA 2012 2 8 13 2016 12 31 2016
3 1898 USA 2013 2 8 13 2016 12 31 2016
4 1898 USA 2014 2 8 13 2016 12 31 2016
5 1898 USA 2015 2 8 13 2016 12 31 2016
6 1898 USA 2016 2 8 13 2016 12 31 2016

Your example data:

mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY", 
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L,
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L,
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L,
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L,
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L,
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L,
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L,
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L,
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L,
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))

R: Consolidating duplicate observations?

Using basic R:

df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)

> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016

Do:

z = aggregate(df[,2:3], 
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)

Result:

> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016

EDIT

If you want to avoid "collapsing" NA do:

z = aggregate(df[,2:3], 
by = list(id = df$ID),

function(x){ paste0(x[!is.na(x)],collapse = "/") })

For a data frame like:

> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016

The result is:

> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016


Related Topics



Leave a reply



Submit