Pandas | merge rows with same id
Use
DataFrame.groupby
- Group DataFrame or Series using a mapper or by a Series of columns..groupby.GroupBy.last
- Compute last of group values.DataFrame.replace
- Replace values given in to_replace with value.
Ex.
df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)
id firstname lastname email updatedate
0 A1 wendy smith smith@mail.com 2019-02-03
1 A2 harry lynn harylynn@mail.com 2019-03-12
2 A3 tinna dickey tinna@mail.com 2013-06-12
3 A4 Tom Lee Tom@mail.com 2012-06-12
4 A5 Ella NaN Ella@mail.com 2019-07-12
5 A6 Ben Lang Ben@mail.com 2019-03-12
Pandas Merge and Complete rows with same id
If there is only one non empty value per groups use:
df = df.replace('',np.nan).groupby('ID', as_index=False).first().fillna('')
If possible multiple values and need unique values in original order use lambda function:
print (df)
ID LU MA ME JE VE SA DI
0 201 B C B
1 201 C C C B C
f = lambda x: ','.join(dict.fromkeys(x.dropna()).keys())
df = df.replace('',np.nan).groupby('ID', as_index=False).agg(f)
print (df)
ID LU MA ME JE VE SA DI
0 201 B,C C C B C
Concatenate rows of pandas DataFrame with same id
You could use groupby
for that with groupby agg
method and tolist
method of Pandas Series:
In [762]: df.groupby('id').agg(lambda x: x.tolist())
Out[762]:
A B
id
0 [1, 2] [1, 1]
1 [3, 0] [2, 2]
groupby
return an Dataframe as you want:
In [763]: df1 = df.groupby('id').agg(lambda x: x.tolist())
In [764]: type(df1)
Out[764]: pandas.core.frame.DataFrame
To exactly match your expected result you could additionally do reset_index
or use as_index=False
in groupby
:
In [768]: df.groupby('id', as_index=False).agg(lambda x: x.tolist())
Out[768]:
id A B
0 0 [1, 2] [1, 1]
1 1 [3, 0] [2, 2]
In [771]: df1.reset_index()
Out[771]:
id A B
0 0 [1, 2] [1, 1]
1 1 [3, 0] [2, 2]
Merge rows with same index and prioritize column values
If you’re guaranteed to not have duplicate columns per id, then the data
(or rather pd.DataFrame(data)
) can easily be reformatted as such:
>>> ser = data.set_index('id').stack()
>>> ser
id
id3 Col_A 11.0
Col_B 5.0
id6 Col_A 3.0
dtype: float64
As a side note, if you unstack it again, you get a more dense version o your original data with a unique index:
>>> ser.unstack()
Col_A Col_B
id
id3 11.0 5.0
id6 3.0 NaN
We can select the first item with a groupby rather than .unstack()
, for example:
>>> ser.groupby('id').first().rename('Col_score')
id
id3 11.0
id6 3.0
Name: Col_Score, dtype: float64
You can then .reset_index()
onto that to get a dataframe instead of a series.
How can I "join" rows with the same ID in pandas and add data as new columns
Let's unstack()
by tracking position using groupby()
+cumcount()
:
df['s']=df.groupby(['name','reference']).cumcount()+1
df=df.set_index(['s','name','reference']).unstack(0)
df.columns=[f"{x}{y}" for x,y in df.columns]
df=df.reset_index()
output of df
:
name reference item1 item2 item3 item4 amount1 amount2 amount3 amount4
0 jane 9876 chair pole NaN NaN 15.0 30.0 NaN NaN
1 john 1234 chair table table pole 40.0 10.0 20.0 10.0
Pandas DataFrame: Merge rows with same id
I was looking for a way to do it without the "apply" function, for better runtime by using pandas build-in functions.
Compare runtimes with and without apply function:
dataset:
data_temp1 = {'timestamp':np.concatenate([np.arange(0,30000,1)]*2), 'code':[6,6, 5]*20000, 'code_2':[6,6, 5]*20000, 'q1':[0.134555,0.984554565478545, 54]*20000, 'q2':[9.7079931640624864,None, 43]*20000, 'q3':[10.25475688648455,None, 54]*20000}
df = pd.DataFrame(data_temp1)
Solution by the use of apply similar to @Andrej Kesely example:
- 7.21 s ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution without apply by my solution:
- 98.4 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
My solution:
(Will fill the empty cells only if exist. So, it's right according to both of your cases).
- Sort the rows by the number of empty cells
- Fill each row in each group by below row (Its ok because with sort them first)
- Remove rows with empty cells
columns_to_groupby = ["timestamp", "code"]
# Sort rows of a dataframe in descending order of None counts
df = df.iloc[df.isnull().sum(1).sort_values(ascending=True).index].set_index(columns_to_groupby)
# group by timestamp column, fill the None cells if exists, delete the incomplete rows (from which we filled in the others)
df.groupby(df.index).bfill().dropna()
Examples:
Example 1:
Input:
Result:
Example 2 (with row without empty cell):
Input:
Result:
As you can see, same result for both of them.
Merge multiple rows in pandas Dataframe based on multiple column values
Does this do what you want?
df.groupby(["id", "date", "freq", "year"]).first().reset_index()
Output:
id date freq year c1 c2 c3
0 C35600010 20080922 A 2004 d20040331 s2003 s3
1 C35600010 20080922 Q 2004 None None s3
2 C35600010 20080923 A 2004 None None s3
Related Topics
Regex to Append Some Characters in a Certain Position
Converting Two Lists into a Matrix
Finding the Maximum Number of Columns in a File or CSV Using Python
What Is the Correct Format to Write Float Value to File in Python
Strip White Spaces from CSV File
How to Remove a Pandas Dataframe from Another Dataframe
Python: Element Is Not Attached to the Page Document
Only Reading First N Rows of CSV File With CSV Reader in Python
Regex That Matches a Number With Commas for Every Three Digits
Splitting Dictionary Items into Smaller Dictionaries Based on Condition
How to Track the Number of Times a Function Is Called
Robot Framework Using Python, Key Press Without Selecting Any Button or Element in the Page
Replacing Pandas or Numpy Nan With a None to Use With Mysqldb
How to Read a List of Parquet Files from S3 as a Pandas Dataframe Using Pyarrow
Pandas Dataframe Check If Column Value Exists in a Group of Columns
Check If Values of Multiple Columns Are the Same (Python)
Pandas Update and Add Rows One Dataframe With Key Column in Another Dataframe
How to Disable the Security Certificate Check in Python Requests