How to Implement SQL Coalesce in Pandas

Coalesce values from 2 columns into a single column in a pandas dataframe

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0

Coalesce (SQL) functionality for Python Pandas

You're trying to cascade from the back. So I reverse the order of the columns with iloc. I follow that up with pd.DataFrame.notnull() to identify which cells are not null. When I subsequently run pd.DataFrame.idxmax, I find all the column names for the first non null value in each row, starting from the back. Finally, I use pd.DataFrame.lookup to find the values associated with the found columns.

df.assign(
category_id=df.iloc[:, ::-1].notnull().idxmax(1).pipe(
lambda d: df.lookup(d.index, d.values)
)
)

category_id1 category_id2 category_id3 category_id4 category_id5 category_id6 category_id7 category_id
0 32991 22 33058 NaN NaN NaN NaN 33058
1 32991 22 51 NaN NaN NaN NaN 51
2 32991 22 121 NaN NaN NaN NaN 121
3 32991 22 120 NaN NaN NaN NaN 120
4 32991 22 32438 NaN NaN NaN NaN 32438

Pandas combine/coalesce multiple columns into 1

Assuming there is always only one value per row across those three columns, as in your example, you could use df.sum(), which skips any NaN by default:

desired_dataframe = pd.DataFrame(base_dataframe['Name'])
desired_dataframe['Mark'] = base_dataframe.iloc[:, 1:4].sum(axis=1)

In case of potentially more values per row, it would perhaps be safer to use e.g. df.max() instead, which works in the same way.

how can we use Coalesce in python for multiple data frames using pandas

You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.

You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.

Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:

columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]

df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)

They are:

df1
EmpID Emp_Name Dept_id DeptName
0 1 NaN 1 NaN
1 2 NaN 2 NaN

df2
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 3 Math

What you need to do is:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

to get ddf:

ddf
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 2 Math

As you can see, only Null values in df1 have been replaced with the corresponding values in df2.

EDIT: A bit deeper explanation

Since I've been asked in the comments, let me give a bit of explanation more on the solution:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.

In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).

Pandas, fillna/bfill to concat and coalesce fields

We can approach your problem in a general way the following:

  1. First we create a temporary column called temp which is the values backfilled.
  2. We insert the column after your bdr column
  3. We convert your date column to datetime
  4. We can '|'.join the first 4 columns and create join_key

notice: step 3 I added to keep your code general, so we don't hardcode 20191031 like you did yourself.

s = df[['cusip', 'isin', 'Deal', 'Id']].replace('', np.NaN).bfill(axis=1).iloc[:, 0]
df.insert(3, 'temp', s)

df['endOfDay'] = pd.to_datetime(df['endOfDay']).dt.strftime('%Y%m%d')

df['join_key'] = df.iloc[:, :4].apply(lambda x: '|'.join(x.astype(str).to_numpy()), axis=1)
df = df.drop(columns='temp')
   endOfDay  book   bdr      cusip          isin Deal          Id                     join_key
0 20191031 15 ITOR 371494AM7 US371494AM77 161 8013210731 20191031|15|ITOR|371494AM7
1 20191031 15 ITOR 8011898573 20191031|15|ITOR|8011898573
2 20191031 15 ITOR 8011898742 20191031|15|ITOR|8011898742
3 20191031 15 ITOR 8011899418 20191031|15|ITOR|8011899418

Apply SQL functions from within a DataFrame

You can use coalesce function:

import org.apache.spark.sql.functions.{coalesce, lit}

case class Foobar(foo: Option[Int], bar: Option[Int])

val df = sc.parallelize(Seq(
Foobar(Some(1), None), Foobar(None, Some(2)),
Foobar(Some(3), Some(4)), Foobar(None, None))).toDF

df.select(coalesce($"foo", $"bar", lit("--"))).show

// +--------------------+
// |coalesce(foo,bar,--)|
// +--------------------+
// | 1|
// | 2|
// | 3|
// | --|
// +--------------------+


Related Topics



Leave a reply



Submit