Pandas | Merge Rows With Same Id

Pandas | merge rows with same id

Use

DataFrame.groupby - Group DataFrame or Series using a mapper or by a Series of columns.
.groupby.GroupBy.last - Compute last of group values.
DataFrame.replace - Replace values given in to_replace with value.

Ex.

df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)

   id firstname lastname              email  updatedate
0  A1     wendy    smith     smith@mail.com  2019-02-03
1  A2     harry     lynn  harylynn@mail.com  2019-03-12
2  A3     tinna   dickey     tinna@mail.com  2013-06-12
3  A4       Tom      Lee       Tom@mail.com  2012-06-12
4  A5      Ella      NaN      Ella@mail.com  2019-07-12
5  A6       Ben     Lang       Ben@mail.com  2019-03-12

Pandas Merge and Complete rows with same id

If there is only one non empty value per groups use:

df = df.replace('',np.nan).groupby('ID', as_index=False).first().fillna('')

If possible multiple values and need unique values in original order use lambda function:

print (df)
    ID LU MA ME JE VE SA DI
0  201  B     C  B         
1  201  C  C  C  B  C    


f = lambda x: ','.join(dict.fromkeys(x.dropna()).keys())
df = df.replace('',np.nan).groupby('ID', as_index=False).agg(f)
print (df)
    ID   LU MA ME JE VE SA DI
0  201  B,C  C  C  B  C

Concatenate rows of pandas DataFrame with same id

You could use groupby for that with groupby agg method and tolist method of Pandas Series:

In [762]: df.groupby('id').agg(lambda x: x.tolist())
Out[762]: 
         A       B
id                
0   [1, 2]  [1, 1]
1   [3, 0]  [2, 2]

groupby return an Dataframe as you want:

In [763]: df1 = df.groupby('id').agg(lambda x: x.tolist())

In [764]: type(df1)
Out[764]: pandas.core.frame.DataFrame

To exactly match your expected result you could additionally do reset_index or use as_index=False in groupby:

In [768]: df.groupby('id', as_index=False).agg(lambda x: x.tolist())
Out[768]: 
   id       A       B
0   0  [1, 2]  [1, 1]
1   1  [3, 0]  [2, 2]

In [771]: df1.reset_index()
Out[771]: 
   id       A       B
0   0  [1, 2]  [1, 1]
1   1  [3, 0]  [2, 2]

Merge rows with same index and prioritize column values

If you’re guaranteed to not have duplicate columns per id, then the data (or rather pd.DataFrame(data)) can easily be reformatted as such:

>>> ser = data.set_index('id').stack()
>>> ser
id        
id3  Col_A    11.0
     Col_B     5.0
id6  Col_A     3.0
dtype: float64

As a side note, if you unstack it again, you get a more dense version o your original data with a unique index:

>>> ser.unstack()
     Col_A  Col_B
id               
id3   11.0    5.0
id6    3.0    NaN

We can select the first item with a groupby rather than .unstack(), for example:

>>> ser.groupby('id').first().rename('Col_score')
id
id3    11.0
id6     3.0
Name: Col_Score, dtype: float64

You can then .reset_index() onto that to get a dataframe instead of a series.

How can I "join" rows with the same ID in pandas and add data as new columns

Let's unstack() by tracking position using groupby()+cumcount():

df['s']=df.groupby(['name','reference']).cumcount()+1
df=df.set_index(['s','name','reference']).unstack(0)
df.columns=[f"{x}{y}" for x,y in df.columns]
df=df.reset_index()

output of df:

   name reference   item1   item2   item3   item4   amount1     amount2     amount3     amount4
0   jane    9876    chair   pole    NaN     NaN     15.0        30.0        NaN         NaN
1   john    1234    chair   table   table   pole    40.0        10.0        20.0        10.0

Pandas DataFrame: Merge rows with same id

I was looking for a way to do it without the "apply" function, for better runtime by using pandas build-in functions.

Compare runtimes with and without apply function:
dataset:

data_temp1 = {'timestamp':np.concatenate([np.arange(0,30000,1)]*2), 'code':[6,6, 5]*20000, 'code_2':[6,6, 5]*20000, 'q1':[0.134555,0.984554565478545, 54]*20000, 'q2':[9.7079931640624864,None, 43]*20000, 'q3':[10.25475688648455,None, 54]*20000} 
df = pd.DataFrame(data_temp1)

Solution by the use of apply similar to @Andrej Kesely example:

7.21 s ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution without apply by my solution:

98.4 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

My solution:
(Will fill the empty cells only if exist. So, it's right according to both of your cases).

Sort the rows by the number of empty cells
Fill each row in each group by below row (Its ok because with sort them first)
Remove rows with empty cells

columns_to_groupby = ["timestamp", "code"]
# Sort rows of a dataframe in descending order of None counts
df = df.iloc[df.isnull().sum(1).sort_values(ascending=True).index].set_index(columns_to_groupby)
# group by timestamp column, fill the None cells if exists, delete the incomplete rows (from which we filled in the others)
df.groupby(df.index).bfill().dropna()

Examples:

Example 1:

Input:
Sample Image

Result:
Sample Image

Example 2 (with row without empty cell):

Input:
Sample Image

Result:
Sample Image

As you can see, same result for both of them.

Merge multiple rows in pandas Dataframe based on multiple column values

Does this do what you want?

df.groupby(["id", "date", "freq", "year"]).first().reset_index()

Output:

          id      date freq  year         c1     c2  c3
0  C35600010  20080922    A  2004  d20040331  s2003  s3
1  C35600010  20080922    Q  2004       None   None  s3
2  C35600010  20080923    A  2004       None   None  s3

Pandas | Merge Rows With Same Id