Coalesce Values from 2 Columns into a Single Column in a Pandas Dataframe

Coalesce values from 2 columns into a single column in a pandas dataframe

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0

How to convert multiple set of column to single column in pandas?

You are essentially asking how to coalesce a values of certain df-columns into one column - you can do it like this:

from random import choice
import pandas as pd

# all azimuth names
azi_names = [f"Azi_{i}" for i in range(5)]

# all distance names
dist_names = [f"Dist_{i}" for i in range(5)]

df = pd.DataFrame(columns = azi_names + dist_names)

# put some values in
for i in range(20):
k = choice(range(5))
df = df.append({f"Azi_{k}": i, f"Dist_{k}": i}, ignore_index=True)

print(df)

which randomly creates:

    Azi_0  Azi_1  Azi_2  Azi_3  Azi_4  Dist_0  Dist_1  Dist_2  Dist_3  Dist_4
0 NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN
1 NaN 1.0 NaN NaN NaN NaN 1.0 NaN NaN NaN
2 2.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
3 NaN NaN 3.0 NaN NaN NaN NaN 3.0 NaN NaN
4 NaN 4.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
5 NaN NaN NaN NaN 5.0 NaN NaN NaN NaN 5.0
6 6.0 NaN NaN NaN NaN 6.0 NaN NaN NaN NaN
7 NaN 7.0 NaN NaN NaN NaN 7.0 NaN NaN NaN
8 NaN 8.0 NaN NaN NaN NaN 8.0 NaN NaN NaN
9 9.0 NaN NaN NaN NaN 9.0 NaN NaN NaN NaN
10 NaN NaN 10.0 NaN NaN NaN NaN 10.0 NaN NaN
11 11.0 NaN NaN NaN NaN 11.0 NaN NaN NaN NaN
12 12.0 NaN NaN NaN NaN 12.0 NaN NaN NaN NaN
13 NaN NaN 13.0 NaN NaN NaN NaN 13.0 NaN NaN
14 NaN 14.0 NaN NaN NaN NaN 14.0 NaN NaN NaN
15 NaN NaN NaN 15.0 NaN NaN NaN NaN 15.0 NaN
16 NaN NaN NaN NaN 16.0 NaN NaN NaN NaN 16.0
17 NaN NaN 17.0 NaN NaN NaN NaN 17.0 NaN NaN
18 NaN NaN NaN NaN 18.0 NaN NaN NaN NaN 18.0
19 NaN NaN NaN 19.0 NaN NaN NaN NaN 19.0 NaN

To coalesce this and only keep filled values you use

df2 = pd.DataFrame()

# propagates values and chooses first
df2["AZI"] = df[azi_names].bfill(axis=1).iloc[:, 0]
df2["DIS"] = df[dist_names].bfill(axis=1).iloc[:, 0]

print(df2)

to get a coalesced new df:

     AZI   DIS
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 4.0
5 5.0 5.0
6 6.0 6.0
7 7.0 7.0
8 8.0 8.0
9 9.0 9.0
10 10.0 10.0
11 11.0 11.0
12 12.0 12.0
13 13.0 13.0
14 14.0 14.0
15 15.0 15.0
16 16.0 16.0
17 17.0 17.0
18 18.0 18.0
19 19.0 19.0

Attributation: inspired by Erfan's answer to Coalesce values from 2 columns into a single column in a pandas dataframe

You may need to Replacing blank values (white space) with NaN in pandas for your shown data.

Pandas combine/coalesce multiple columns into 1

Assuming there is always only one value per row across those three columns, as in your example, you could use df.sum(), which skips any NaN by default:

desired_dataframe = pd.DataFrame(base_dataframe['Name'])
desired_dataframe['Mark'] = base_dataframe.iloc[:, 1:4].sum(axis=1)

In case of potentially more values per row, it would perhaps be safer to use e.g. df.max() instead, which works in the same way.

How to Coalesce datetime values from 3 columns into a single column in a pandas dataframe?

You are right, a mix of bfill and ffill on the axis columns should do it:

df.assign(ACTUAL_START_DATE = df.filter(like='DATE')
.bfill(axis=1)
.ffill(axis=1)
.min(axis=1)
)

CLIENT_ID DATE_BEGIN DATE_START DATE_REGISTERED ACTUAL_START_DATE
0 1 2020-01-01 2020-01-01 2020-01-01 2020-01-01
1 2 2020-01-02 2020-02-01 2020-01-01 2020-01-01
2 3 NaN 2020-05-01 2020-04-01 2020-04-01
3 4 2020-01-01 2020-01-01 NaN 2020-01-01

Creating another column in pandas df based on partially empty columns

Backfill values from id2 to id1. Extract the numbers. Convert to int then str.

Given:

    id1   id2
0 ID01 ID01
1 NaN ID03
2 ID07 NaN
3 ID08 ID08

Doing:

df['college_name'] = 'College' + (df.bfill(axis=1)['id1']
.str.extract('(\d+)')
.astype(int)
.astype(str))

Output:

    id1   id2 college_name
0 ID01 ID01 College1
1 NaN ID03 College3
2 ID07 NaN College7
3 ID08 ID08 College8

To check for rows where the ids are different:

Given:

    id1   id2
0 ID01 ID01
1 NaN ID03
2 ID07 NaN
3 ID08 ID98

Doing:

print(df[df.id1.ne(df.id2) & df.id1.notna() & df.id2.notna()])

Output:

    id1   id2
3 ID08 ID98

Is there a better more readable way to coalese columns in pandas

You could use pd.isnull to find the null -- in this case None -- values:

In [169]: pd.isnull(df)
Out[169]:
first second third
0 False False False
1 True False False
2 True True False
3 True True True
4 False True False

and then use np.argmin to find the index of the first non-null value. If all the values are null, np.argmin returns 0:

In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])

Then you could select the desired values from df using NumPy integer-indexing:

In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)

For example,

import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])

mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]

yields

  first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B

Using argmin instead of df3.apply(coalesce, ...) is significantly quicker if the DataFrame has a lot of rows:

df2 = pd.concat([df]*1000)

In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 µs per loop

In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop

Apply Coalesce after grouping on two columns in pandas

It looks like you want to groupby consecutive blocks of ID. If so:

blocks = df['ID'].ne(df['ID'].shift()).cumsum()

agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)

Pandas Coalesce Multiple Columns, NaN

The last chain fillna for cusip is too complicated. You may change it to bfill

final['join_key'] = (final['book'].astype('str') + 
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))

Coalesce Pandas DataFrame DOWN Columns

Try:

print(df.bfill().head(1))

Prints:

   Col A  Col B  Col C  Col D Col E
0 Row 1 20.0 4.0 1.0 text


Related Topics



Leave a reply



Submit