Combining Duplicated Rows in R and Adding New Column Containing Ids of Duplicates

Combining duplicated rows in R and adding new column containing IDs of duplicates

Some call to aggregate() should do the trick.

Here's an option that collects the ID's in a list object:

(df1 <- aggregate(df[7], df[-7], unique))
#   Chr   start    stop ref alt Hom.het       ID
# 1 chr1 5179574 5183384 ref Del     Het      719
# 2 chr1 5179574 5184738 ref Del     Het 915, 951
# 3 chr1 5336806 5358384 ref Del     Het      376
# 4 chr1 5347979 5358384 ref Del     Het      228

And here's one that collects them in a character vector:

df2 <- aggregate(df[7], df[-7], 
                 FUN = function(X) paste(unique(X), collapse=", "))

Comparing the results of the two options:

str(df1$ID)
# List of 4
#  $ 0: int 719
#  $ 3: int [1:2] 915 951
#  $ 7: int 376
#  $ 8: int 228

str(df2$ID)
# chr [1:4] "719" "915, 951" "376" "228"

Combine duplicate rows in dataframe and create new columns

We could do this with dcast from data.table which can take multiple value.var columns. Convert the 'data.frame' to 'data.table' (setDT(dataframe1)), grouped by 'Company_Name', replace the 'Company_Phone' _ elements with the first alphanumeric string, then dcast from 'long' to 'wide' by specifying 'Employee_Name' and 'Employee_ID' as the value.var columns

library(data.table)
setDT(dataframe1)[, Company_Phone := first(Company_Phone), Company_Name]
res <- dcast(dataframe1, Company_Name + Company_ID + Company_Phone ~ 
       rowid(Company_Name), value.var  = c("Employee_Name", "Employee_ID"), sep='')

-output

res
#Company_Name Company_ID Company_Phone Employee_Name1 Employee_Name2 Employee_Name3 Employee_ID1 Employee_ID2 Employee_ID3
#1:          KFC          1        237389           John           Mary           Jane         1001         1002         1003
#2:          McD          2        237002         Joshua           Anne             NA         2001         2002           NA

If we need to order it

res[, c(1:3, order(as.numeric(sub("\\D+", "", names(res)[-(1:3)]))) + 3), with = FALSE]
#   Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1:          KFC          1        237389           John         1001           Mary         1002           Jane         1003
#2:          McD          2        237002         Joshua         2001           Anne         2002             NA           NA

Merging rows in a dataframe R with duplicate id's

You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by @Andrea M).

library(dplyr)
library(lubridate)

df1 <- df %>%
  group_by(person_id) %>% 
  mutate(id = seq_along(person_id)) %>% 
  pivot_wider(names_from = id,
              values_from = test_date,
              names_prefix = "test_date") %>%  
  summarize_all(list(~ .[!is.na(.)][1]))

Output

> df1
# A tibble: 2 x 9
  person_id serial_number freezer_number test_1   test_2   test_3 test_4 test_date1 test_date2
  <chr>     <chr>         <chr>          <chr>    <chr>    <lgl>  <lgl>  <chr>      <chr>     
1 x         c             d              positive positive NA     NA     01/01/2010 05/01/2010
2 y         e             f              positive NA       NA     NA     02/02/2020 NA

Sum duplicate rows that are grouped and combine their IDs in R

We could use group_by with summarise to paste (str_c) the 'dive_phase' and sum the 'beats20_max'

library(dplyr)
library(stringr)
df1 %>%
    group_by(seal_ID, diveNum, datetime) %>%
    summarise(dive_phase = str_c(dive_phase, collapse = ""), 
         beats20_max = sum(beats20_max, na.rm = TRUE), .groups = 'drop') %>%
    select(any_of(names(df1)))

-output

# A tibble: 12 × 5
   seal_ID diveNum dive_phase datetime            beats20_max
   <chr>     <int> <chr>      <chr>                     <int>
 1 Baikal       19 D          2019-04-02 14:43:00          12
 2 Baikal       19 D          2019-04-02 14:43:20          14
 3 Baikal       19 D          2019-04-02 14:43:40          15
 4 Baikal       19 D          2019-04-02 14:44:00          15
 5 Baikal       19 D          2019-04-02 14:44:20          14
 6 Baikal       19 D          2019-04-02 14:44:40          13
 7 Baikal       19 D          2019-04-02 14:45:00          15
 8 Baikal       19 D          2019-04-02 14:45:20          15
 9 Baikal       19 D          2019-04-02 14:45:40          15
10 Baikal       19 BD         2019-04-02 14:46:00          16
11 Baikal       19 B          2019-04-02 14:46:20          15
12 Baikal       19 B          2019-04-02 14:46:40          15

data

df1 <- structure(list(seal_ID = c("Baikal", "Baikal", "Baikal", "Baikal", 
"Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", 
"Baikal", "Baikal"), diveNum = c(19L, 19L, 19L, 19L, 19L, 19L, 
19L, 19L, 19L, 19L, 19L, 19L, 19L), dive_phase = c("D", "D", 
"D", "D", "D", "D", "D", "D", "D", "B", "D", "B", "B"), 
atetime = c("2019-04-02 14:43:00", 
"2019-04-02 14:43:20", "2019-04-02 14:43:40", "2019-04-02 14:44:00", 
"2019-04-02 14:44:20", "2019-04-02 14:44:40", "2019-04-02 14:45:00", 
"2019-04-02 14:45:20", "2019-04-02 14:45:40", "2019-04-02 14:46:00", 
"2019-04-02 14:46:00", "2019-04-02 14:46:20", "2019-04-02 14:46:40"
), HR_mean = c(38.6, 42.2, 44, 45.5, 42.1, 39.9, 45.5, 44.6, 
45.9, 46.1, 55.8, 47.4, 45.4), HR_max = c(44.8, 48, 54.1, 61.9, 
49.2, 44.1, 54.5, 53.1, 51.7, 51.7, 59.4, 57.1, 53.6), beats20_mean = c(6.5, 
7.5, 8, 8, 7.5, 7, 8, 8, 8, 7.5, 1.5, 8, 8), beats20_max = c(12L, 
14L, 15L, 15L, 14L, 13L, 15L, 15L, 15L, 14L, 2L, 15L, 15L)),
 class = "data.frame", row.names = c("8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", 
"20"))

Find duplicate IDs across columns and combine rows together in R

I identified some omissions from the routines I originally posted which gave rise to the duplicates you found. I have corrected those in the revised code below:

data.duplicates = union(intersect(data$ID1, data$ID2), intersect(data$ID1, data$ID3))

data.t = 
  data.frame(t(data)) %>%
  mutate(Cols = rownames(.)) %>%
  gather(key = row, value = ID, starts_with("X"))

data.t.joined = 
  filter(data.t, !is.na(ID)) %>% 
  inner_join(data.t, by = "ID")

data.t.combined =
  data.t.joined %>%
  mutate(row.x = as.integer(gsub("X", "", row.x))) %>%
  select(row.x, ID) %>%
  group_by(row.x) %>%
  summarise(IDs = list(sort(unique(ID))))

mergesubsets <- function(thedata){
  rows = nrow(thedata)
  for (i in 1:rows){
    entry = unlist(thedata$IDs[i])
    for (j in 1:rows){
      if (i!=j){
        otherentry = unlist(thedata$IDs[j])
        if(max(entry %in% otherentry)==1) {
          entry = sort(union(entry, otherentry))
          thedata$IDs[i] = list(entry)
        }
      }
    }
  }
  thedata[!duplicated(thedata$IDs),]
}

listtodataframe <- function(thedata){
  rows = nrow(thedata)
  cols = max(sapply(thedata$IDs, length))
  result = matrix(nrow = rows, ncol = cols)
  for (i in 1:rows){
    entry = unlist(thedata$IDs[i])
    for (j in 1:length(entry)){
      result[i, j] = entry[j]
    }
  }
  data.frame(result)
}

data.t.merged = data.t.combined

prevrows = 0
rows = nrow(data.t.merged)
starttime = proc.time()[3]
while(rows != prevrows) {
  prevrows = rows
  data.t.merged = 
    mergesubsets(data.t.merged)
  rows = nrow(data.t.merged)
}
endtime = proc.time()[3]
timetorun = endtime - starttime 
timetorun

result = listtodataframe(data.t.merged)
result
write.csv(result, "result.csv", row.names = FALSE)

I have tested the revised code with a different dataset of 500 rows of numeric data which is too big to post here. It now correctly identifies all unique sets of values, no matter how many across these may be. When I've tested with sets of randomised data I've in some cases ended up with a single row which has all the unique values in it.

Sorry to say I have not so far been able to speed up this version, which as you mentioned is time consuming when operating on 11,000 rows. I appreciate that your real data is non-numeric, which is also less efficient to process than numeric data.

My tests have been on 500 rows, taking approx. 25 seconds to complete.

An example of its output from the 500 rows of all-numeric test data is shown below.

> result
        X1     X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13
1      100    101   102   103   104   105   200   300    NA    NA    NA    NA    NA
2      110    113   210   321    NA    NA    NA    NA    NA    NA    NA    NA    NA
3      111    211   311    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
4      112    312   412    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
5      500    600   601   602   603   604   605   610   613   700   710   800   821
6      611    711   811    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
7      612    812   912    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
8     1000   1100  1101  1102  1103  1104  1105  1110  1113  1200  1210  1300  1321
9     1111   1211  1311    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10    1112   1312  1412    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
11    1500   1600  1601  1602  1603  1604  1605  1610  1613  1700  1710  1800  1821

consolidate duplicate rows and add column in R

To go from df1 to df.ideal, you can use aggregate().

aggregate(col2~col1, df1, paste, collapse=",")
#    col1 col2
# 1 test1    1
# 2 test2  2,3
# 3 test3    4

If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do

reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
#    col1 col2.1 col2.2
# 1 test1      1     NA
# 2 test2      2      3
# 4 test3      4     NA

using just the base reshape() function.

How to create an ID column for duplicate rows based on data from another column?

transform(data, ear_ID =paste(Study_ID, substr(ear, 1, 1), sep='_'))

  Study_ID   ear ear_ID
1      100  Left  100_L
2      100 Right  100_R
3      200  Left  200_L
4      200 Right  200_R
5      300  Left  300_L
6      300 Right  300_R

Note that tidyverse, you can just group by both the two columns and each will be considered unique identifier of the ear

Combining Duplicated Rows in R and Adding New Column Containing Ids of Duplicates