pandas three-way joining multiple dataframes on columns
Zero's answer is basically a reduce
operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name
in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
Python: pandas merge multiple dataframes
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
is There any methods to merge multiple dataframes of different templates
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
Cross join of three dataframes
You can merge in two steps. For example for March:
tmp = pd.merge(january_df, february_df, on='ID')
final_df = pd.merge(tmp, march_df, on='ID', how='right')[['January', 'February', 'March', 'Product_no', 'Label', 'ID']].fillna(0)
print(final_df)
Prints:
January February March Product_no Label ID
0 1.0 2.0 5 T1 Towel 1005
1 0.0 0.0 1 E1 Earring 1006
2 3.0 4.0 1 S1 Shoe 1002
3 1.0 1.0 1 B3 Bag 1004
4 0.0 0.0 1 L1 Lotion 1007
5 4.0 3.0 1 B1 Ball 1000
Merge multiple dataframes based on a common column
Use merge
and reduce
In [86]: from functools import reduce
In [87]: reduce(lambda x,y: pd.merge(x,y, on='Col1', how='outer'), [df1, df2, df3])
Out[87]:
Col1 Col2 Col3 Col4 Col5 Col6 Col7
0 data1 3 4 7.0 4.0 NaN NaN
1 data2 4 3 6.0 9.0 5.0 8.0
2 data3 2 3 1.0 4.0 2.0 7.0
3 data4 2 4 NaN NaN NaN NaN
4 data5 1 4 NaN NaN 5.0 3.0
Details
In [88]: df1
Out[88]:
Col1 Col2 Col3
0 data1 3 4
1 data2 4 3
2 data3 2 3
3 data4 2 4
4 data5 1 4
In [89]: df2
Out[89]:
Col1 Col4 Col5
0 data1 7 4
1 data2 6 9
2 data3 1 4
In [90]: df3
Out[90]:
Col1 Col6 Col7
0 data2 5 8
1 data3 2 7
2 data5 5 3
pandas merged multiple dataframes of different size and columns
Assuming unique NAME values per dataframe, I would use pandas.concat
here:
(pd
.concat(test_df).rename_axis('index')
.groupby(['NAME', 'index']).first()
.reset_index('NAME')
)
output:
NAME col1 col2 col3 col4
index
0 A 1.0 1.0 1.0 1.0
1 A 1.0 1.0 1.0 1.0
0 B 2.0 2.0 2.0 2.0
1 B 2.0 2.0 2.0 2.0
Merging multiple dataframes on column
use pd.concat
dflist = [df1, df2]
keys = ["%d" % i for i in range(1, len(dflist) + 1)]
merged = pd.concat([df.set_index('name') for df in dflist], axis=1, keys=keys)
merged.columns = merged.swaplevel(0, 1, 1).columns.to_series().str.join('_')
merged
Or
merged.reset_index()
Joining or merging multiple columns within one dataframe and keeping all data
Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
- Use
join
first onPosition1, Count1
andPosition2, Count2
(withPosition1
renamed asPositions
) then on that join result andPosition3, Count3
. - Sort by
Positions
and usereset_index
to create a new integer range index (ascending with no gaps).
Related Topics
How to Find All Matches to a Regular Expression in Python
How to Get All Subsets of a Set - Powerset
Python Rounding Error With Float Numbers
How to Properly Ignore Exceptions
How to Save and Load Cookies Using Python + Selenium Webdriver
How to Replace Nans by Preceding or Next Values in Pandas Dataframe
How to Understand Closure in a Lambda
How to Create a Daemon in Python
Django Template How to Look Up a Dictionary Value With a Variable
Do Regular Expressions from the Re Module Support Word Boundaries (\B)
Is There a Simple, Elegant Way to Define Singletons
How to Stop More Than 1 Bullet Firing At Once
Using Both Python 2.X and Python 3.X in Ipython Notebook
Python Exit Commands - Why So Many and When Should Each Be Used
Maximum and Minimum Values For Ints