Difference(S) Between Merge() and Concat() in Pandas

Difference between pd.concat() and pd.merge() and why do I get wrong output?

I believe you need merge with left_index=True and right_index=True because match by DatetimeIndex in both DataFrames:

#convert to DatetimeIndex
df2.index = pd.to_datetime(df2.index)
df = pd.merge(df1, df2, left_index=True, right_index=True)

Pandas DataFrame concat vs append

So what are you doing is with append and concat is almost equivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.

I almost always use concat (though in this case they are equivalent, except for the empty frame);
if you don't use the empty frame they will be the same speed.

In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))

In [18]: df1
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A 10000 non-null values
dtypes: int64(1)

In [19]: df4 = pd.DataFrame()

The concat

In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop

This is equavalent of your append

In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of

3: 56.8 ms per loop

What is the difference between 'pd.concat([df1, df2], join='outer')', 'df1.combine_first(df2)', 'pd.merge(df1, df2)' and 'df1.join(df2, how='outer')'?

concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.

combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).

So if there are no missing values in your dataframe you wouldn't normally use combine_first.

merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.

join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.

What is the difference between join and merge in Pandas?

I always use join on indices:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

val_l val_r
key
foo 1 4
bar 2 5

The same functionality can be had by using merge on the columns follows:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

key val_l val_r
0 foo 1 4
1 bar 2 5

Pandas Dataframe concat: is it correct to understand append as a simplified version of concat with few kwargs and can only operate on axis=0

Yes, pd.append simply calls pd.concat with the default arguments axis=0, join='outer' which you can see in the return statement. It also has limited functionality, so you can't use it to construct a hierarchical index.

pd.append source

    from pandas.core.reshape.concat import concat
if isinstance(other, (list, tuple)):
to_concat = [self] + other
else:
to_concat = [self, other]
return concat(to_concat, ignore_index=ignore_index,
verify_integrity=verify_integrity,
sort=sort)

Pandas - concat two df along non-index axis, merge rows that have same value on non-index axis

It's just merge:

pd.merge(df_a.reset_index(), 
df_b.reset_index(),
on='seconds_since_start',
how='outer')

Output:

    valid_a                value_a    seconds_since_start  valid_b                value_b
-- ------------------- --------- --------------------- ------------------- ---------
0 2000-02-15 14:47:00 12.3 0 NaT nan
1 2000-02-15 15:59:00 20.6 30 2019-12-24 15:54:00 18.7
2 2000-02-15 16:51:00 20.3 120 NaT nan
3 2000-02-15 17:52:00 22.6 200 NaT nan
4 NaT nan 20 2019-12-24 14:54:00 12.4
5 NaT nan 90 2019-12-24 16:54:00 19.2
6 NaT nan 250 2019-12-24 17:54:00 20.8


Related Topics



Leave a reply



Submit