Duplicated rows when merging dataframes in Python
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1
matches with each john smith in list_2
. I had to drop the duplicates in one of the lists. I chose list_2
.
Merge dataframes based on column values with duplicated rows
You have to specify the different column names to match on with left_on
and right_on
. Also specify how='right'
to use only keys from the right frame.
df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')
Pandas: Merge two dataframes with duplicate rows
You can use drop_duplicates
on subset=['CUSTOMER_FULL_NAME']
in the merge
with how='left'
to keep all rows from people
such as:
full = pd.merge(
people,
orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
how='left' #and add the how='left'
)
So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first')
will contain only once each name and during the merge
, the matching will be with only this unique name
Merging multiple data frames causing duplicate column names
You can do
s = pd.concat([x.set_index('key') for x in df_list],axis = 1,keys=range(len(df_list)))
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s = s.reset_index()
s
Out[236]:
key value_0 value_1 value_2 value_3
0 A -1.957968 NaN -0.852135 -0.976960
1 B 1.545932 -0.276838 NaN 0.197615
2 C -2.149727 NaN -0.364382 0.349993
3 D 0.524990 -0.476655 NaN NaN
4 E NaN -2.135870 0.798782 NaN
5 F NaN 1.456544 -0.255705 0.447279
Merge data.frames with duplicates
First define a function, run.seq
, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq
column to each component. Finally use Reduce
to merge them all.
run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))
L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))
out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]
The last line gives:
> out
names data1 data2 data3
1 a 1 1 NA
2 b 2 NA NA
3 c 3 4 1
4 c 4 5 NA
5 d 5 6 NA
6 e NA 2 2
7 e NA 3 NA
EDIT: Revised run.seq
so that input need not be sorted.
How to merge two dataframes and retain only unique data from initial dataframes without duplicates?
If we can assume the two dfs can be perfectly aligned (i.e. each has the same number of rows and the same combinations of factor levels) then your final suggestion should work. To make it more robust, you could first check if the dfs align correctly with all.equal()
, and handle failures however you prefer.
library(dplyr)
## Create dfs
df1 <- data.frame(f1 = factor(rep("A", 4)),
f2 = factor(c("Y", "Y", "Z", "Z")),
num1 = c(1,2,3,4))
df2 <- data.frame(f1 = factor(rep("A", 4)),
f2 = factor(c("Z", "Z", "Y", "Y")),
num2 = c(15,16,17,18))
# Align dfs
df1 <- df1 %>% arrange(f1, f2, num1)
df2 <- df2 %>% arrange(f1, f2, num2)
# Check that dfs are correctly aligned
if(!all.equal(df1[,c('f1','f2')], df2[,c('f1','f2')])){
print("Failed to align.")
}
# Copy column across
df1$num2 <- df2$num2
df1
f1 f2 num1 num2
1 A Y 1 17
2 A Y 2 18
3 A Z 3 15
4 A Z 4 16
Related Topics
Calculate Difference Between Dates by Group in R
Adding a Legend to an Rgl 3D Plot
Separate a Column into Multiple Columns Using Tidyr::Separate with Sep=""
Plot a Function with Several Arguments in R
Ggplot2_Error: Geom_Point Requires the Following Missing Aesthetics: Y
How to Split a Vector by Delimiter
R Shiny: Multiple Use in UI of Same Renderui in Server
Create a Concentric Circle Legend for a Ggplot Bubble Chart
Getting the Minimum of the Rows in a Data Frame
Ggplot Scale_X_Continuous with Symbol: Make Bold
Create a Variable That Identifies the Original Data.Frame After Rbind Command in R
How to Pop Up the Graphics Window from Rscript
Placement of Error Bars in Barplot Using Ggplot2
Assign Color to 2 Different Geoms and Get 2 Different Legends
Getting Unique Rows of a Table and Their Numbers
R: Calculate the Number of Occurrences of a Specific Event in a Specified Time Future