Difference between pd.concat() and pd.merge() and why do I get wrong output?
I believe you need merge
with left_index=True
and right_index=True
because match by DatetimeIndex
in both DataFrame
s:
#convert to DatetimeIndex
df2.index = pd.to_datetime(df2.index)
df = pd.merge(df1, df2, left_index=True, right_index=True)
Pandas DataFrame concat vs append
So what are you doing is with append and concat is almost equivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.
I almost always use concat (though in this case they are equivalent, except for the empty frame);
if you don't use the empty frame they will be the same speed.
In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))
In [18]: df1
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A 10000 non-null values
dtypes: int64(1)
In [19]: df4 = pd.DataFrame()
The concat
In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop
This is equavalent of your append
In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of
3: 56.8 ms per loop
What is the difference between 'pd.concat([df1, df2], join='outer')', 'df1.combine_first(df2)', 'pd.merge(df1, df2)' and 'df1.join(df2, how='outer')'?
concat
: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.
combine_first
: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.
merge
is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.
join
merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right'))
. As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.
What is the difference between join and merge in Pandas?
I always use join
on indices:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
The same functionality can be had by using merge
on the columns follows:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
Pandas Dataframe concat: is it correct to understand append as a simplified version of concat with few kwargs and can only operate on axis=0
Yes, pd.append
simply calls pd.concat
with the default arguments axis=0, join='outer'
which you can see in the return
statement. It also has limited functionality, so you can't use it to construct a hierarchical index.
pd.append source
from pandas.core.reshape.concat import concat
if isinstance(other, (list, tuple)):
to_concat = [self] + other
else:
to_concat = [self, other]
return concat(to_concat, ignore_index=ignore_index,
verify_integrity=verify_integrity,
sort=sort)
Pandas - concat two df along non-index axis, merge rows that have same value on non-index axis
It's just merge:
pd.merge(df_a.reset_index(),
df_b.reset_index(),
on='seconds_since_start',
how='outer')
Output:
valid_a value_a seconds_since_start valid_b value_b
-- ------------------- --------- --------------------- ------------------- ---------
0 2000-02-15 14:47:00 12.3 0 NaT nan
1 2000-02-15 15:59:00 20.6 30 2019-12-24 15:54:00 18.7
2 2000-02-15 16:51:00 20.3 120 NaT nan
3 2000-02-15 17:52:00 22.6 200 NaT nan
4 NaT nan 20 2019-12-24 14:54:00 12.4
5 NaT nan 90 2019-12-24 16:54:00 19.2
6 NaT nan 250 2019-12-24 17:54:00 20.8
Related Topics
Finding Duplicate Files and Removing Them
@Csrf_Exempt Does Not Work on Generic View Based Class
Format String Unused Named Arguments
"Private" (Implementation) Class in Python
How to Change My Desktop Background with Python
Pyplot Move Alternative Y Axis to Background
Checking Odd/Even Numbers and Changing Outputs on Number Size
Read Unicode Characters from Command-Line Arguments in Python 2.X on Windows
Why Is the Exit Window Button Work But the Exit Button in the Game Does Not Work
Kivy Not Working (Error: Unable to Find Any Valuable Window Provider.)
Generalise Slicing Operation in a Numpy Array
Python Pandas Group by Date Using Datetime Data
Python Library 'Unittest': Generate Multiple Tests Programmatically
How to Get Value from Form Field in Django Framework
How to One Hot Encode Variant Length Features