Merge Data.Frames with Duplicates

Duplicated rows when merging dataframes in Python

list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])

Sample Image

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

Merge dataframes based on column values with duplicated rows

You have to specify the different column names to match on with left_on and right_on. Also specify how='right' to use only keys from the right frame.

df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')

Pandas: Merge two dataframes with duplicate rows

You can use drop_duplicates on subset=['CUSTOMER_FULL_NAME'] in the merge with how='left' to keep all rows from people such as:

full = pd.merge(
people,
orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
how='left' #and add the how='left'
)

So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first') will contain only once each name and during the merge, the matching will be with only this unique name

Merging multiple data frames causing duplicate column names

You can do

s = pd.concat([x.set_index('key') for x in df_list],axis = 1,keys=range(len(df_list)))
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s = s.reset_index()
s
Out[236]:
key value_0 value_1 value_2 value_3
0 A -1.957968 NaN -0.852135 -0.976960
1 B 1.545932 -0.276838 NaN 0.197615
2 C -2.149727 NaN -0.364382 0.349993
3 D 0.524990 -0.476655 NaN NaN
4 E NaN -2.135870 0.798782 NaN
5 F NaN 1.456544 -0.255705 0.447279

Merge data.frames with duplicates

First define a function, run.seq, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq column to each component. Finally use Reduce to merge them all.

run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))

L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))

out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]

The last line gives:

> out
names data1 data2 data3
1 a 1 1 NA
2 b 2 NA NA
3 c 3 4 1
4 c 4 5 NA
5 d 5 6 NA
6 e NA 2 2
7 e NA 3 NA

EDIT: Revised run.seq so that input need not be sorted.

How to merge two dataframes and retain only unique data from initial dataframes without duplicates?

If we can assume the two dfs can be perfectly aligned (i.e. each has the same number of rows and the same combinations of factor levels) then your final suggestion should work. To make it more robust, you could first check if the dfs align correctly with all.equal(), and handle failures however you prefer.

library(dplyr)
## Create dfs
df1 <- data.frame(f1 = factor(rep("A", 4)),
f2 = factor(c("Y", "Y", "Z", "Z")),
num1 = c(1,2,3,4))

df2 <- data.frame(f1 = factor(rep("A", 4)),
f2 = factor(c("Z", "Z", "Y", "Y")),
num2 = c(15,16,17,18))

# Align dfs
df1 <- df1 %>% arrange(f1, f2, num1)
df2 <- df2 %>% arrange(f1, f2, num2)

# Check that dfs are correctly aligned
if(!all.equal(df1[,c('f1','f2')], df2[,c('f1','f2')])){
print("Failed to align.")
}

# Copy column across
df1$num2 <- df2$num2

df1
f1 f2 num1 num2
1 A Y 1 17
2 A Y 2 18
3 A Z 3 15
4 A Z 4 16


Related Topics



Leave a reply



Submit