row not consolidating duplicates in R when using multiple months in Date Filter
The OP is wondering why the result is not consolidated for CUST_ID
if more than one month of data is processed.
The reason is that the monthly files are read in and aggregated one by one but a final aggregation step is needed to consolidate over all months.
The code below is a simplified replacement of the double for
loops. I have left out the code for testing for "fast filtering".
The first part creates a list of files to be processed. The second part does the processing.
# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)
# read and aggregate each file separately
mastertable <- rbindlist(
lapply(in_filenames, function(fn) {
# code for "fast filter" test goes here
message("Reading in ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate
temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
})
)[
# THIS IS THE MISSING STEP:
# second aggregation for overall totals
, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
Processing file: Raw Data/AA-CA/AA-CA 2017-01.txt
Processing file: Raw Data/AA-CA/AA-CA 2017-02.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-01.txt
Processing file: Raw Data/CB-HZ/CB-HZ 2017-02.txt
mastertable
CUST_ID Ext Sale
1: AK0010001 427.803
2: CO0020001 1540.300
3: CO0010001 -179.765
Note that chaining of data.table
expressions is used here.
Edit 1:
By request of the OP, here is the complete code (except for the "fast filtering" stuff). There are some additional lines which where modified. They are marked with ### MODIFIED
.
library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "Raw Data" ### MODIFIED
# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")
# Set location for output
out_directory <- "YTD Master" ### MODIFIED
out_filename <- "OUTPUT.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28")) ### MODIFIED
# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
date_filter <- TRUE
##########
## CODE ##
##########
starttime <- Sys.time()
# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)
# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(in_filenames, function(fn) {
# code for fast filter test goes here
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate by month
temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
})
)[
# second aggregation overall
, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
# rm(mastertable) ### MODIFIED
print(Sys.time()-starttime)
Edit 2
The OP has asked to include the "fast filter" code which I had omitted for brevity.
However, I have a different approach. Instead of reading the first line of each file to check if INVOICE_DT
is within the given date_range
my approach filters the file names. The file names contain the year-month in ISO 8601 format.
So, a vector of allowed year-month strings is constructed from the given date_range
. Only those file names which contain one of the allowed year-month strings are selected for further processing.
However, selecting the proper files is only the first step. As the date-range
may start or end right in the middel of a month, we need also to filter the rows of each processed file. This step is missing from OP's code.
library(data.table, warn.conflicts = FALSE)
library(magrittr) ### MODIFIED
# library(lubridate, warn.conflicts = FALSE) ### MODIFIED
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "Raw Data" ### MODIFIED
# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")
# Set location for output
out_directory <- "YTD Master" ### MODIFIED
out_filename <- "OUTPUT.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2017-01-01", "2017-02-14") ### MODIFIED
# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE ### MODIFIED
##########
## CODE ##
##########
starttime <- Sys.time()
# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)
# filter filenames, only
selected_in_filenames <-
seq(as.Date(date_range[1]),
as.Date(date_range[2]), by = "1 month") %>%
format("%Y-%m") %>%
lapply(function(x) stringr::str_subset(in_filenames, x)) %>%
unlist()
# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(selected_in_filenames, function(fn) {
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate file but filtered for date_range
temptable[INVOICE_DT %between% date_range,
lapply(.SD, sum), by = .(CUST_ID, QTR = quarter(INVOICE_DT)),
.SDcols = c("Ext Sale")]
})
)[
# second aggregation overall
, lapply(.SD, sum), by = .(CUST_ID, QTR), .SDcols = c("Ext Sale")]
# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable) ### MODIFIED
print(Sys.time()-starttime)
mastertable
CUST_ID QTR Ext Sale
1: AK0010001 1 209.970
2: CO0020001 1 1540.300
3: CO0010001 1 -1.565
Note that date_range <- c("2017-01-01", "2017-02-14")
now ends mid of February.
keep most recent observations when there are duplicates in R
Using tidyverse
df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()
mark overlapping time periods for duplicate rows
Using tidyverse
, this gives you a column with marked overlapping rows (only showing the last 3 columns).
Some remarks:
- This only works for 2 duplicates (2 rows) because it's not defined what to do when one overlaps and the other doesn't (3+ rows).
- Not having a range (only one date) is not defined. Right now it's returning
NA
. - It's not clear how to "combine" the non-overlapping rows.
library(dplyr)
library(tidyr)
df %>%
group_by(Dup_N) %>%
mutate( overlap=(lead(inschrijf_first)>inschrijf_first &
lead(inschrijf_first)<inschrijf_last)|
(lead(inschrijf_last)>inschrijf_first &
lead(inschrijf_last)<inschrijf_last) ) %>%
fill(overlap, .direction="downup") %>%
ungroup() %>%
print(width=90)
# A tibble: 20 × 10
... inschrijf_first inschrijf_last Dup_N overlap
... <date> <date> <int> <lgl>
... 1 2013-10-01 2019-04-01 19 FALSE
... 2 2019-10-01 2020-10-01 19 FALSE
... 3 2019-10-01 2020-01-01 501 FALSE
... 4 2020-07-01 2020-10-01 501 FALSE
... 5 2014-07-01 2020-07-01 557 NA
... 6 2020-10-01 NA 557 NA
... 7 2014-01-01 2020-10-01 661 FALSE
... 8 2013-01-01 2013-07-01 661 FALSE
... 9 2006-01-01 2015-01-01 975 FALSE
... 10 2018-07-01 2020-10-01 975 FALSE
... 11 2015-07-01 2016-01-01 1067 FALSE
... 12 2016-10-01 2020-10-01 1067 FALSE
... 13 2014-04-01 NA 1079 NA
... 14 2014-07-01 2016-01-01 1079 NA
... 15 2006-01-01 2016-04-01 1195 TRUE
... 16 2012-07-01 2016-01-01 1195 TRUE
... 17 2014-01-01 2015-01-01 1352 NA
... 18 2020-10-01 NA 1352 NA
... 19 2014-01-01 2015-10-01 1355 FALSE
... 20 2016-01-01 2020-10-01 1355 FALSE
merging incomplete duplicate rows
If we want to sample
a row after grouping by 'dates', 'co.name', we can use that in slice
library(dplyr)
df %>%
group_by(dates, co.name) %>%
slice(sample(row_number(), 1))
Or with sample_n
df %>%
group_by(dates, co.name) %>%
sample_n(1)
Identify duplicates based on duplicated data and some additional conditions, using `dplyr`
Does this work:
df %>% group_by(category, name, date, checkup_complete) %>% filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)
# A tibble: 9 x 5
# Groups: category, name, date, checkup_complete [8]
index name category date checkup_complete
<dbl> <chr> <chr> <chr> <chr>
1 2 name1 cat1 date1 N
2 3 name1 cat2 date1 Y
3 4 name1 cat1 date1 Y
4 5 name1 cat1 date2 N
5 6 name2 cat1 date1 Y
6 7 name3 cat1 date2 Y
7 9 name3 cat1 date2 N
8 10 name2 cat2 date1 Y
9 11 name1 cat1 date1 N
group_by and filter removes too many rows
You need to use any
to check for the presence of c("B", "C")
within each group and not at each row; see below:
library(dplyr)
df %>%
group_by(x,y) %>%
mutate(unique_z = n_distinct(z)) %>%
filter(unique_z > 1,
any(z %in% c("B","C")))
## any(z %in% c("C")) & any(z %in% c("B")))
## use this one instead if you want B and C present at the same time ...
## ... and two B's or two C's are not desired
# # A tibble: 3 x 4
# # Groups: x, y [1]
# x y z unique_z
# <fct> <fct> <fct> <int>
# 1 Person A 2022-02-01 A 3
# 2 Person A 2022-02-01 C 3
# 3 Person A 2022-02-01 B 3
Consolidate duplicate rows
This works:
library(plyr)
ddply(df,"x",numcolwise(sum))
in words: (1) split the data frame df
by the "x"
column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd
in ddply
stands for "take a d ata frame as input, return a d ata frame")
Another, possibly clearer, approach:
aggregate(y~x,data=df,FUN=sum)
See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.
How do I duplicate and add rows between the values of two different columns?
I think tidyr::expand()
and full_seq()
can achieve what you want, with grouping on stateabb
and styear
since you have multiple start years for some states.
Assuming your data frame is named mydata
, something like this. I have retained the column of expanded years and named it filled_year
, but you may want to remove it.
library(dplyr)
library(tidyr)
new_data <- mydata %>%
group_by(stateabb, styear) %>%
tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>%
inner_join(mydata) %>%
rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
ungroup()
The top and bottom of the USA
rows:
new_data %>%
filter(stateabb == "USA") %>%
head()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 1898 2 8 13 2016 12 31 2016
2 1898 USA 1899 2 8 13 2016 12 31 2016
3 1898 USA 1900 2 8 13 2016 12 31 2016
4 1898 USA 1901 2 8 13 2016 12 31 2016
5 1898 USA 1902 2 8 13 2016 12 31 2016
6 1898 USA 1903 2 8 13 2016 12 31 2016
new_data %>%
filter(stateabb == "USA") %>%
tail()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 2011 2 8 13 2016 12 31 2016
2 1898 USA 2012 2 8 13 2016 12 31 2016
3 1898 USA 2013 2 8 13 2016 12 31 2016
4 1898 USA 2014 2 8 13 2016 12 31 2016
5 1898 USA 2015 2 8 13 2016 12 31 2016
6 1898 USA 2016 2 8 13 2016 12 31 2016
Your example data:
mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY",
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L,
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L,
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L,
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L,
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L,
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L,
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L,
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L,
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L,
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))
R: Consolidating duplicate observations?
Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
Related Topics
Upload and View a PDF in R Shiny
Group Rows in Data Frame Based on Time Difference Between Consecutive Rows
R Predict Function Returning Too Many Values
How to Insert Missing Observations on a Data Frame
Logical Comparison of Two Vectors with Binary (0/1) Result
Aggregate by Multiple Columns and Reshape from Long to Wide
How to Create a Presence-Absence Matrix
How to Print Double Quotes (") in R
Month Language in the As.Date Function
Remove Duplicate Rows of a Matrix or Dataframe
Split or Separate Uneven/Unequal Strings with No Delimiter
R How to Remove Rows in a Data Frame Based on the First Character of a Column
How to Clear an Na Flag for a Posix Value
Removing One Table from Another in R
Can Sparklyr Be Used with Spark Deployed on Yarn-Managed Hadoop Cluster
Overlapping the Predicted Time Series on the Original Series in R