how to merge and summerise rows based on 2 columns of dataframe in r
Turning data into a long format, and then back to wide should do something similar, I think. Try this:
library(tidyr)
df2 = df %>%
pivot_longer(cols = c(B, R, S)) %>%
filter(is.na(value) == FALSE) %>%
pivot_wider(names_from = name, values_from = value)
How do I merge rows of Excel data based on a single column identifier without data loss?
You can do =UNIQUE(C1:C100)
in a separate column and copy and paste those values over themselves (assuming the range of emails is in C2:C100 and that the headers are in the first row).
Then in an adjacent column do something like =CONCAT(FILTER(D$2:D$100, $C$2:$C$100=$H2))
and drag down and to the right (where column H contains the column of unique emails). Then simply copy and paste values over themselves again and remove the old columns.
R: Combine rows with same ID
Something like this:
Here we first group for all except the Var
variables, then we use summarise(across...
as suggested by @Limey in the comments section.
Main feature is to use na.rm=TRUE
:
library(dplyr)
df %>%
group_by(ID, Date, N_Date, type) %>%
summarise(across(starts_with("Var"), ~sum(., na.rm = TRUE)))
ID Date N_Date type Var1 Var2 Var3 Var4
<int> <chr> <int> <chr> <int> <int> <int> <int>
1 1 4.7.22 50000 normal 12 23 5 54
2 2 4.7.22 4000 normal 0 2 0 0
3 3 5.7.22 20000 normal 7 0 0 0
Merging rows based on multiple conditions
An approach using fill
library(dplyr)
library(tidyr)
df %>%
group_by(region) %>%
fill(q1:q5, .direction="updown") %>%
arrange(enterprise) %>%
summarise(across(q1:q5, ~ .x[1]))
# A tibble: 3 × 6
region q1 q2 q3 q4 q5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Africa 1 NA 0 0 NA
2 Asia 0 1 1 0 NA
3 Europe 0 0 NA 1 0
pandas combine rows based on conditions
If I understand the requirements correctly, we can do all of this within pandas
. There are essentially two steps here:
- use
pandas.merge_asof
to fill in nearestend_date
- use
drop_duplicates
to removeout
records we used in step 1
text = StringIO(
"""
id url type start_time end_time
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:23:28
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:27:34
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:23:52
o6FlbuA_5565423 https://vaa.66new in 2021-08-25T15:23:37 NaT
o6FlbuA_5565423 https://vaa.66new in 2021-08-25T15:43:56 NaT # note: no record with `end_time` after this records `start_time`
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:10:29
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:25:00
o6FlbuA_5565423 https://vaa.66new out NaT 2021-08-25T15:15:49
o6FlbuA_5565423 https://vaa.66new in 2021-08-25T15:33:37 2021-08-25T15:34:37 # additional already complete record
"""
)
df = pd.read_csv(text, delim_whitespace=True, parse_dates=["start_time", "end_time"], comment="#")
# separate out unmatched `in` records and unmatched `out` records
df_in_unmatched = (
df[(df.type == "in") & ~df.start_time.isna() & df.end_time.isna()]
.drop(columns=["end_time"])
.sort_values("start_time")
)
df_out_unmatched = (
df[(df.type == "out") & df.start_time.isna() & ~df.end_time.isna()]
.drop(columns=["type", "start_time"])
.sort_values("end_time")
)
# match `in` records to closest `out` record with `out.end_time` >= `in.start_time`
df_in_matched = pd.merge_asof(
df_in_unmatched,
df_out_unmatched,
by=["id", "url"],
left_on="start_time",
right_on="end_time",
direction="forward",
allow_exact_matches=True,
)
# fill in missing `end_time` for records with only `start_time`
df_in_matched["end_time"] = df_in_matched["end_time"].combine_first(
df_in_matched["start_time"]
)
# combine matched records with remaining unmatched and deduplicate
# in order to remove "used" records
df_matched = (
pd.concat([df_in_matched, df_out_unmatched], ignore_index=True)
.drop_duplicates(subset=["id", "url", "end_time"], keep="first")
.dropna(subset=["end_time"])
.fillna({"type": "out"})
)
# fill in missing `start_time` for records with only `end_time`
df_matched["start_time"] = df_matched["start_time"].combine_first(
df_matched["end_time"]
)
# combine matched records with unprocessed records: i.e. records
# that had both `start_time` and `end_time` (if extant)
df_final = pd.concat(
[df_matched, df.dropna(subset=["start_time", "end_time"])], ignore_index=True
)
Result:
id url type start_time end_time
0 o6FlbuA_5565423 https://vaa.66new in 2021-08-25 15:23:37 2021-08-25 15:23:52
1 o6FlbuA_5565423 https://vaa.66new in 2021-08-25 15:43:56 2021-08-25 15:43:56
2 o6FlbuA_5565423 https://vaa.66new out 2021-08-25 15:10:29 2021-08-25 15:10:29
3 o6FlbuA_5565423 https://vaa.66new out 2021-08-25 15:15:49 2021-08-25 15:15:49
4 o6FlbuA_5565423 https://vaa.66new out 2021-08-25 15:23:28 2021-08-25 15:23:28
5 o6FlbuA_5565423 https://vaa.66new out 2021-08-25 15:25:00 2021-08-25 15:25:00
6 o6FlbuA_5565423 https://vaa.66new out 2021-08-25 15:27:34 2021-08-25 15:27:34
7 o6FlbuA_5565423 https://vaa.66new in 2021-08-25 15:33:37 2021-08-25 15:34:37
Related Topics
R Programming: Read.Csv() Skips Lines Unexpectedly
How to Merge Two Data Frame Based on Partial String Match with R
Logistic Regression: How to Try Every Combination of Predictors in R
Reshape Data from Wide to Long
Count Number of Distinct Values in a Vector
Change Standard Error Color for Geom_Smooth
Shiny Ui.R - Error in Tag("Div", List(...)) - Not Sure Where Error Is
Error in Install.Packages:Type =="Both" Cannot Be Used with 'Repos =Null'
Changing the Order of Dodged Bars in Ggplot2 Barplot
Character String Is Not in a Standard Unambiguous Format
How to Highlight Area Between Two Lines? Ggplot
Do I Need to Reshape This Wide Data to Effectively Use Ggplot2
Assign Color to 2 Different Geoms and Get 2 Different Legends