Combining Duplicated Rows in R and Adding New Column Containing Ids of Duplicates

Combining duplicated rows in R and adding new column containing IDs of duplicates

Some call to aggregate() should do the trick.

Here's an option that collects the ID's in a list object:

(df1 <- aggregate(df[7], df[-7], unique))
# Chr start stop ref alt Hom.het ID
# 1 chr1 5179574 5183384 ref Del Het 719
# 2 chr1 5179574 5184738 ref Del Het 915, 951
# 3 chr1 5336806 5358384 ref Del Het 376
# 4 chr1 5347979 5358384 ref Del Het 228

And here's one that collects them in a character vector:

df2 <- aggregate(df[7], df[-7], 
FUN = function(X) paste(unique(X), collapse=", "))

Comparing the results of the two options:

str(df1$ID)
# List of 4
# $ 0: int 719
# $ 3: int [1:2] 915 951
# $ 7: int 376
# $ 8: int 228

str(df2$ID)
# chr [1:4] "719" "915, 951" "376" "228"

Combine duplicate rows in dataframe and create new columns

We could do this with dcast from data.table which can take multiple value.var columns. Convert the 'data.frame' to 'data.table' (setDT(dataframe1)), grouped by 'Company_Name', replace the 'Company_Phone' _ elements with the first alphanumeric string, then dcast from 'long' to 'wide' by specifying 'Employee_Name' and 'Employee_ID' as the value.var columns

library(data.table)
setDT(dataframe1)[, Company_Phone := first(Company_Phone), Company_Name]
res <- dcast(dataframe1, Company_Name + Company_ID + Company_Phone ~
rowid(Company_Name), value.var = c("Employee_Name", "Employee_ID"), sep='')

-output

res
#Company_Name Company_ID Company_Phone Employee_Name1 Employee_Name2 Employee_Name3 Employee_ID1 Employee_ID2 Employee_ID3
#1: KFC 1 237389 John Mary Jane 1001 1002 1003
#2: McD 2 237002 Joshua Anne NA 2001 2002 NA

If we need to order it

res[, c(1:3, order(as.numeric(sub("\\D+", "", names(res)[-(1:3)]))) + 3), with = FALSE]
# Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1: KFC 1 237389 John 1001 Mary 1002 Jane 1003
#2: McD 2 237002 Joshua 2001 Anne 2002 NA NA

Merging rows in a dataframe R with duplicate id's

You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by @Andrea M).

library(dplyr)
library(lubridate)

df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))

Output

> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA

Sum duplicate rows that are grouped and combine their IDs in R

We could use group_by with summarise to paste (str_c) the 'dive_phase' and sum the 'beats20_max'

library(dplyr)
library(stringr)
df1 %>%
group_by(seal_ID, diveNum, datetime) %>%
summarise(dive_phase = str_c(dive_phase, collapse = ""),
beats20_max = sum(beats20_max, na.rm = TRUE), .groups = 'drop') %>%
select(any_of(names(df1)))

-output

# A tibble: 12 × 5
seal_ID diveNum dive_phase datetime beats20_max
<chr> <int> <chr> <chr> <int>
1 Baikal 19 D 2019-04-02 14:43:00 12
2 Baikal 19 D 2019-04-02 14:43:20 14
3 Baikal 19 D 2019-04-02 14:43:40 15
4 Baikal 19 D 2019-04-02 14:44:00 15
5 Baikal 19 D 2019-04-02 14:44:20 14
6 Baikal 19 D 2019-04-02 14:44:40 13
7 Baikal 19 D 2019-04-02 14:45:00 15
8 Baikal 19 D 2019-04-02 14:45:20 15
9 Baikal 19 D 2019-04-02 14:45:40 15
10 Baikal 19 BD 2019-04-02 14:46:00 16
11 Baikal 19 B 2019-04-02 14:46:20 15
12 Baikal 19 B 2019-04-02 14:46:40 15

data

df1 <- structure(list(seal_ID = c("Baikal", "Baikal", "Baikal", "Baikal", 
"Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal",
"Baikal", "Baikal"), diveNum = c(19L, 19L, 19L, 19L, 19L, 19L,
19L, 19L, 19L, 19L, 19L, 19L, 19L), dive_phase = c("D", "D",
"D", "D", "D", "D", "D", "D", "D", "B", "D", "B", "B"),
atetime = c("2019-04-02 14:43:00",
"2019-04-02 14:43:20", "2019-04-02 14:43:40", "2019-04-02 14:44:00",
"2019-04-02 14:44:20", "2019-04-02 14:44:40", "2019-04-02 14:45:00",
"2019-04-02 14:45:20", "2019-04-02 14:45:40", "2019-04-02 14:46:00",
"2019-04-02 14:46:00", "2019-04-02 14:46:20", "2019-04-02 14:46:40"
), HR_mean = c(38.6, 42.2, 44, 45.5, 42.1, 39.9, 45.5, 44.6,
45.9, 46.1, 55.8, 47.4, 45.4), HR_max = c(44.8, 48, 54.1, 61.9,
49.2, 44.1, 54.5, 53.1, 51.7, 51.7, 59.4, 57.1, 53.6), beats20_mean = c(6.5,
7.5, 8, 8, 7.5, 7, 8, 8, 8, 7.5, 1.5, 8, 8), beats20_max = c(12L,
14L, 15L, 15L, 14L, 13L, 15L, 15L, 15L, 14L, 2L, 15L, 15L)),
class = "data.frame", row.names = c("8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
"20"))

Find duplicate IDs across columns and combine rows together in R

I identified some omissions from the routines I originally posted which gave rise to the duplicates you found. I have corrected those in the revised code below:

data.duplicates = union(intersect(data$ID1, data$ID2), intersect(data$ID1, data$ID3))

data.t =
data.frame(t(data)) %>%
mutate(Cols = rownames(.)) %>%
gather(key = row, value = ID, starts_with("X"))

data.t.joined =
filter(data.t, !is.na(ID)) %>%
inner_join(data.t, by = "ID")

data.t.combined =
data.t.joined %>%
mutate(row.x = as.integer(gsub("X", "", row.x))) %>%
select(row.x, ID) %>%
group_by(row.x) %>%
summarise(IDs = list(sort(unique(ID))))

mergesubsets <- function(thedata){
rows = nrow(thedata)
for (i in 1:rows){
entry = unlist(thedata$IDs[i])
for (j in 1:rows){
if (i!=j){
otherentry = unlist(thedata$IDs[j])
if(max(entry %in% otherentry)==1) {
entry = sort(union(entry, otherentry))
thedata$IDs[i] = list(entry)
}
}
}
}
thedata[!duplicated(thedata$IDs),]
}

listtodataframe <- function(thedata){
rows = nrow(thedata)
cols = max(sapply(thedata$IDs, length))
result = matrix(nrow = rows, ncol = cols)
for (i in 1:rows){
entry = unlist(thedata$IDs[i])
for (j in 1:length(entry)){
result[i, j] = entry[j]
}
}
data.frame(result)
}

data.t.merged = data.t.combined

prevrows = 0
rows = nrow(data.t.merged)
starttime = proc.time()[3]
while(rows != prevrows) {
prevrows = rows
data.t.merged =
mergesubsets(data.t.merged)
rows = nrow(data.t.merged)
}
endtime = proc.time()[3]
timetorun = endtime - starttime
timetorun

result = listtodataframe(data.t.merged)
result
write.csv(result, "result.csv", row.names = FALSE)

I have tested the revised code with a different dataset of 500 rows of numeric data which is too big to post here. It now correctly identifies all unique sets of values, no matter how many across these may be. When I've tested with sets of randomised data I've in some cases ended up with a single row which has all the unique values in it.

Sorry to say I have not so far been able to speed up this version, which as you mentioned is time consuming when operating on 11,000 rows. I appreciate that your real data is non-numeric, which is also less efficient to process than numeric data.

My tests have been on 500 rows, taking approx. 25 seconds to complete.

An example of its output from the 500 rows of all-numeric test data is shown below.

> result
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 100 101 102 103 104 105 200 300 NA NA NA NA NA
2 110 113 210 321 NA NA NA NA NA NA NA NA NA
3 111 211 311 NA NA NA NA NA NA NA NA NA NA
4 112 312 412 NA NA NA NA NA NA NA NA NA NA
5 500 600 601 602 603 604 605 610 613 700 710 800 821
6 611 711 811 NA NA NA NA NA NA NA NA NA NA
7 612 812 912 NA NA NA NA NA NA NA NA NA NA
8 1000 1100 1101 1102 1103 1104 1105 1110 1113 1200 1210 1300 1321
9 1111 1211 1311 NA NA NA NA NA NA NA NA NA NA
10 1112 1312 1412 NA NA NA NA NA NA NA NA NA NA
11 1500 1600 1601 1602 1603 1604 1605 1610 1613 1700 1710 1800 1821

consolidate duplicate rows and add column in R

To go from df1 to df.ideal, you can use aggregate().

aggregate(col2~col1, df1, paste, collapse=",")
# col1 col2
# 1 test1 1
# 2 test2 2,3
# 3 test3 4

If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do

reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
# col1 col2.1 col2.2
# 1 test1 1 NA
# 2 test2 2 3
# 4 test3 4 NA

using just the base reshape() function.

How to create an ID column for duplicate rows based on data from another column?

transform(data, ear_ID =paste(Study_ID, substr(ear, 1, 1), sep='_'))

Study_ID ear ear_ID
1 100 Left 100_L
2 100 Right 100_R
3 200 Left 200_L
4 200 Right 200_R
5 300 Left 300_L
6 300 Right 300_R

Note that tidyverse, you can just group by both the two columns and each will be considered unique identifier of the ear



Related Topics



Leave a reply



Submit