Pandas | Merge Rows With Same Id

Pandas | merge rows with same id

Use

  • DataFrame.groupby - Group DataFrame or Series using a mapper or by a Series of columns.
  • .groupby.GroupBy.last - Compute last of group values.
  • DataFrame.replace - Replace values given in to_replace with value.

Ex.

df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)

id firstname lastname email updatedate
0 A1 wendy smith smith@mail.com 2019-02-03
1 A2 harry lynn harylynn@mail.com 2019-03-12
2 A3 tinna dickey tinna@mail.com 2013-06-12
3 A4 Tom Lee Tom@mail.com 2012-06-12
4 A5 Ella NaN Ella@mail.com 2019-07-12
5 A6 Ben Lang Ben@mail.com 2019-03-12

Pandas Merge and Complete rows with same id

If there is only one non empty value per groups use:

df = df.replace('',np.nan).groupby('ID', as_index=False).first().fillna('')

If possible multiple values and need unique values in original order use lambda function:

print (df)
ID LU MA ME JE VE SA DI
0 201 B C B
1 201 C C C B C


f = lambda x: ','.join(dict.fromkeys(x.dropna()).keys())
df = df.replace('',np.nan).groupby('ID', as_index=False).agg(f)
print (df)
ID LU MA ME JE VE SA DI
0 201 B,C C C B C

Concatenate rows of pandas DataFrame with same id

You could use groupby for that with groupby agg method and tolist method of Pandas Series:

In [762]: df.groupby('id').agg(lambda x: x.tolist())
Out[762]:
A B
id
0 [1, 2] [1, 1]
1 [3, 0] [2, 2]

groupby return an Dataframe as you want:

In [763]: df1 = df.groupby('id').agg(lambda x: x.tolist())

In [764]: type(df1)
Out[764]: pandas.core.frame.DataFrame

To exactly match your expected result you could additionally do reset_index or use as_index=False in groupby:

In [768]: df.groupby('id', as_index=False).agg(lambda x: x.tolist())
Out[768]:
id A B
0 0 [1, 2] [1, 1]
1 1 [3, 0] [2, 2]

In [771]: df1.reset_index()
Out[771]:
id A B
0 0 [1, 2] [1, 1]
1 1 [3, 0] [2, 2]

Merge rows with same index and prioritize column values

If you’re guaranteed to not have duplicate columns per id, then the data (or rather pd.DataFrame(data)) can easily be reformatted as such:

>>> ser = data.set_index('id').stack()
>>> ser
id
id3 Col_A 11.0
Col_B 5.0
id6 Col_A 3.0
dtype: float64

As a side note, if you unstack it again, you get a more dense version o your original data with a unique index:

>>> ser.unstack()
Col_A Col_B
id
id3 11.0 5.0
id6 3.0 NaN

We can select the first item with a groupby rather than .unstack(), for example:

>>> ser.groupby('id').first().rename('Col_score')
id
id3 11.0
id6 3.0
Name: Col_Score, dtype: float64

You can then .reset_index() onto that to get a dataframe instead of a series.

How can I "join" rows with the same ID in pandas and add data as new columns

Let's unstack() by tracking position using groupby()+cumcount():

df['s']=df.groupby(['name','reference']).cumcount()+1
df=df.set_index(['s','name','reference']).unstack(0)
df.columns=[f"{x}{y}" for x,y in df.columns]
df=df.reset_index()

output of df:

   name reference   item1   item2   item3   item4   amount1     amount2     amount3     amount4
0 jane 9876 chair pole NaN NaN 15.0 30.0 NaN NaN
1 john 1234 chair table table pole 40.0 10.0 20.0 10.0

Pandas DataFrame: Merge rows with same id

I was looking for a way to do it without the "apply" function, for better runtime by using pandas build-in functions.

Compare runtimes with and without apply function:
dataset:

data_temp1 = {'timestamp':np.concatenate([np.arange(0,30000,1)]*2), 'code':[6,6, 5]*20000, 'code_2':[6,6, 5]*20000, 'q1':[0.134555,0.984554565478545, 54]*20000, 'q2':[9.7079931640624864,None, 43]*20000, 'q3':[10.25475688648455,None, 54]*20000} 
df = pd.DataFrame(data_temp1)

Solution by the use of apply similar to @Andrej Kesely example:

  • 7.21 s ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution without apply by my solution:

  • 98.4 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

My solution:
(Will fill the empty cells only if exist. So, it's right according to both of your cases).

  • Sort the rows by the number of empty cells
  • Fill each row in each group by below row (Its ok because with sort them first)
  • Remove rows with empty cells
columns_to_groupby = ["timestamp", "code"]
# Sort rows of a dataframe in descending order of None counts
df = df.iloc[df.isnull().sum(1).sort_values(ascending=True).index].set_index(columns_to_groupby)
# group by timestamp column, fill the None cells if exists, delete the incomplete rows (from which we filled in the others)
df.groupby(df.index).bfill().dropna()

Examples:

Example 1:

Input:
Sample Image

Result:
Sample Image

Example 2 (with row without empty cell):

Input:
Sample Image

Result:
Sample Image

As you can see, same result for both of them.

Merge multiple rows in pandas Dataframe based on multiple column values

Does this do what you want?

df.groupby(["id", "date", "freq", "year"]).first().reset_index()

Output:

          id      date freq  year         c1     c2  c3
0 C35600010 20080922 A 2004 d20040331 s2003 s3
1 C35600010 20080922 Q 2004 None None s3
2 C35600010 20080923 A 2004 None None s3


Related Topics



Leave a reply



Submit