Coalesce values from 2 columns into a single column in a pandas dataframe
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
How to convert multiple set of column to single column in pandas?
You are essentially asking how to coalesce a values of certain df-columns into one column - you can do it like this:
from random import choice
import pandas as pd
# all azimuth names
azi_names = [f"Azi_{i}" for i in range(5)]
# all distance names
dist_names = [f"Dist_{i}" for i in range(5)]
df = pd.DataFrame(columns = azi_names + dist_names)
# put some values in
for i in range(20):
k = choice(range(5))
df = df.append({f"Azi_{k}": i, f"Dist_{k}": i}, ignore_index=True)
print(df)
which randomly creates:
Azi_0 Azi_1 Azi_2 Azi_3 Azi_4 Dist_0 Dist_1 Dist_2 Dist_3 Dist_4
0 NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN
1 NaN 1.0 NaN NaN NaN NaN 1.0 NaN NaN NaN
2 2.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
3 NaN NaN 3.0 NaN NaN NaN NaN 3.0 NaN NaN
4 NaN 4.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
5 NaN NaN NaN NaN 5.0 NaN NaN NaN NaN 5.0
6 6.0 NaN NaN NaN NaN 6.0 NaN NaN NaN NaN
7 NaN 7.0 NaN NaN NaN NaN 7.0 NaN NaN NaN
8 NaN 8.0 NaN NaN NaN NaN 8.0 NaN NaN NaN
9 9.0 NaN NaN NaN NaN 9.0 NaN NaN NaN NaN
10 NaN NaN 10.0 NaN NaN NaN NaN 10.0 NaN NaN
11 11.0 NaN NaN NaN NaN 11.0 NaN NaN NaN NaN
12 12.0 NaN NaN NaN NaN 12.0 NaN NaN NaN NaN
13 NaN NaN 13.0 NaN NaN NaN NaN 13.0 NaN NaN
14 NaN 14.0 NaN NaN NaN NaN 14.0 NaN NaN NaN
15 NaN NaN NaN 15.0 NaN NaN NaN NaN 15.0 NaN
16 NaN NaN NaN NaN 16.0 NaN NaN NaN NaN 16.0
17 NaN NaN 17.0 NaN NaN NaN NaN 17.0 NaN NaN
18 NaN NaN NaN NaN 18.0 NaN NaN NaN NaN 18.0
19 NaN NaN NaN 19.0 NaN NaN NaN NaN 19.0 NaN
To coalesce this and only keep filled values you use
df2 = pd.DataFrame()
# propagates values and chooses first
df2["AZI"] = df[azi_names].bfill(axis=1).iloc[:, 0]
df2["DIS"] = df[dist_names].bfill(axis=1).iloc[:, 0]
print(df2)
to get a coalesced new df:
AZI DIS
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 4.0
5 5.0 5.0
6 6.0 6.0
7 7.0 7.0
8 8.0 8.0
9 9.0 9.0
10 10.0 10.0
11 11.0 11.0
12 12.0 12.0
13 13.0 13.0
14 14.0 14.0
15 15.0 15.0
16 16.0 16.0
17 17.0 17.0
18 18.0 18.0
19 19.0 19.0
Attributation: inspired by Erfan's answer to Coalesce values from 2 columns into a single column in a pandas dataframe
You may need to Replacing blank values (white space) with NaN in pandas for your shown data.
Pandas combine/coalesce multiple columns into 1
Assuming there is always only one value per row across those three columns, as in your example, you could use df.sum()
, which skips any NaN
by default:
desired_dataframe = pd.DataFrame(base_dataframe['Name'])
desired_dataframe['Mark'] = base_dataframe.iloc[:, 1:4].sum(axis=1)
In case of potentially more values per row, it would perhaps be safer to use e.g. df.max()
instead, which works in the same way.
How to Coalesce datetime values from 3 columns into a single column in a pandas dataframe?
You are right, a mix of bfill
and ffill
on the axis columns should do it:
df.assign(ACTUAL_START_DATE = df.filter(like='DATE')
.bfill(axis=1)
.ffill(axis=1)
.min(axis=1)
)
CLIENT_ID DATE_BEGIN DATE_START DATE_REGISTERED ACTUAL_START_DATE
0 1 2020-01-01 2020-01-01 2020-01-01 2020-01-01
1 2 2020-01-02 2020-02-01 2020-01-01 2020-01-01
2 3 NaN 2020-05-01 2020-04-01 2020-04-01
3 4 2020-01-01 2020-01-01 NaN 2020-01-01
Creating another column in pandas df based on partially empty columns
Backfill values from id2
to id1
. Extract the numbers. Convert to int
then str
.
Given:
id1 id2
0 ID01 ID01
1 NaN ID03
2 ID07 NaN
3 ID08 ID08
Doing:
df['college_name'] = 'College' + (df.bfill(axis=1)['id1']
.str.extract('(\d+)')
.astype(int)
.astype(str))
Output:
id1 id2 college_name
0 ID01 ID01 College1
1 NaN ID03 College3
2 ID07 NaN College7
3 ID08 ID08 College8
To check for rows where the ids are different:
Given:
id1 id2
0 ID01 ID01
1 NaN ID03
2 ID07 NaN
3 ID08 ID98
Doing:
print(df[df.id1.ne(df.id2) & df.id1.notna() & df.id2.notna()])
Output:
id1 id2
3 ID08 ID98
Is there a better more readable way to coalese columns in pandas
You could use pd.isnull
to find the null -- in this case None
-- values:
In [169]: pd.isnull(df)
Out[169]:
first second third
0 False False False
1 True False False
2 True True False
3 True True True
4 False True False
and then use np.argmin
to find the index of the first non-null value. If all the values are null, np.argmin
returns 0:
In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])
Then you could select the desired values from df
using NumPy integer-indexing:
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)
For example,
import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]
yields
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
Using argmin instead of df3.apply(coalesce, ...)
is significantly quicker if the DataFrame has a lot of rows:
df2 = pd.concat([df]*1000)
In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 µs per loop
In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop
Apply Coalesce after grouping on two columns in pandas
It looks like you want to groupby consecutive blocks of ID
. If so:
blocks = df['ID'].ne(df['ID'].shift()).cumsum()
agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)
Pandas Coalesce Multiple Columns, NaN
The last chain fillna
for cusip
is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))
Coalesce Pandas DataFrame DOWN Columns
Try:
print(df.bfill().head(1))
Prints:
Col A Col B Col C Col D Col E
0 Row 1 20.0 4.0 1.0 text
Related Topics
Cs50: Like Operator, Variable Substitution with % Expansion
How to Do Multiple Arguments to Map Function Where One Remains the Same
Matplotlib Xticks Not Lining Up with Histogram
Problems with Pip Install Numpy - Runtimeerror: Broken Toolchain: Cannot Link a Simple C Program
Importerror: No Module Named _Ssl
How to Deal with Multi-Level Column Names Downloaded with Yfinance
How to Do Exponentiation in Python
How Does Condensed Distance Matrix Work? (Pdist)
Difference Between Numpy Dot() and Python 3.5+ Matrix Multiplication @
How to Get Rid of Python Tkinter Root Window
How to Find Most Common Elements of a List
Python Replace Single Backslash with Double Backslash
Cannot Concatenate 'Str' and 'Float' Objects
How to Retrieve Items from a Dictionary in the Order That They'Re Inserted