Merge Data.Frames with Duplicates

Duplicated rows when merging dataframes in Python

list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

Merge dataframes based on column values with duplicated rows

You have to specify the different column names to match on with left_on and right_on. Also specify how='right' to use only keys from the right frame.

df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')

Pandas: Merge two dataframes with duplicate rows

You can use drop_duplicates on subset=['CUSTOMER_FULL_NAME'] in the merge with how='left' to keep all rows from people such as:

full = pd.merge(
    people,
    orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
    left_on='FULL_NAME',
    right_on='CUSTOMER_FULL_NAME',
    how='left' #and add the how='left'
)

So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first') will contain only once each name and during the merge, the matching will be with only this unique name

Merging multiple data frames causing duplicate column names

You can do

s = pd.concat([x.set_index('key') for x in df_list],axis = 1,keys=range(len(df_list)))
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s = s.reset_index()
s
Out[236]: 
  key   value_0   value_1   value_2   value_3
0   A -1.957968       NaN -0.852135 -0.976960
1   B  1.545932 -0.276838       NaN  0.197615
2   C -2.149727       NaN -0.364382  0.349993
3   D  0.524990 -0.476655       NaN       NaN
4   E       NaN -2.135870  0.798782       NaN
5   F       NaN  1.456544 -0.255705  0.447279

Merge data.frames with duplicates

First define a function, run.seq, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq column to each component. Finally use Reduce to merge them all.

run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))

L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))

out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]

The last line gives:

> out
  names data1 data2 data3
1     a     1     1    NA
2     b     2    NA    NA
3     c     3     4     1
4     c     4     5    NA
5     d     5     6    NA
6     e    NA     2     2
7     e    NA     3    NA

EDIT: Revised run.seq so that input need not be sorted.

How to merge two dataframes and retain only unique data from initial dataframes without duplicates?

If we can assume the two dfs can be perfectly aligned (i.e. each has the same number of rows and the same combinations of factor levels) then your final suggestion should work. To make it more robust, you could first check if the dfs align correctly with all.equal(), and handle failures however you prefer.

library(dplyr)
## Create dfs
df1 <- data.frame(f1 = factor(rep("A", 4)),
                  f2 = factor(c("Y", "Y", "Z", "Z")),
                  num1 = c(1,2,3,4))

df2 <- data.frame(f1 = factor(rep("A", 4)),
                  f2 = factor(c("Z", "Z", "Y", "Y")),
                  num2 = c(15,16,17,18))

# Align dfs 
df1 <- df1 %>% arrange(f1, f2, num1) 
df2 <- df2 %>% arrange(f1, f2, num2)

# Check that dfs are correctly aligned
if(!all.equal(df1[,c('f1','f2')], df2[,c('f1','f2')])){
    print("Failed to align.")
}

# Copy column across
df1$num2 <- df2$num2

df1
  f1 f2 num1 num2
1  A  Y    1   17
2  A  Y    2   18
3  A  Z    3   15
4  A  Z    4   16