How to Implement SQL Coalesce in Pandas

Coalesce values from 2 columns into a single column in a pandas dataframe

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0

Coalesce (SQL) functionality for Python Pandas

You're trying to cascade from the back. So I reverse the order of the columns with iloc. I follow that up with pd.DataFrame.notnull() to identify which cells are not null. When I subsequently run pd.DataFrame.idxmax, I find all the column names for the first non null value in each row, starting from the back. Finally, I use pd.DataFrame.lookup to find the values associated with the found columns.

df.assign(
    category_id=df.iloc[:, ::-1].notnull().idxmax(1).pipe(
        lambda d: df.lookup(d.index, d.values)
    )
)

   category_id1  category_id2  category_id3  category_id4  category_id5  category_id6  category_id7  category_id
0         32991            22         33058           NaN           NaN           NaN           NaN        33058
1         32991            22            51           NaN           NaN           NaN           NaN           51
2         32991            22           121           NaN           NaN           NaN           NaN          121
3         32991            22           120           NaN           NaN           NaN           NaN          120
4         32991            22         32438           NaN           NaN           NaN           NaN        32438

Pandas combine/coalesce multiple columns into 1

Assuming there is always only one value per row across those three columns, as in your example, you could use df.sum(), which skips any NaN by default:

desired_dataframe = pd.DataFrame(base_dataframe['Name'])
desired_dataframe['Mark'] = base_dataframe.iloc[:, 1:4].sum(axis=1)

In case of potentially more values per row, it would perhaps be safer to use e.g. df.max() instead, which works in the same way.

how can we use Coalesce in python for multiple data frames using pandas

You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.

You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.

Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:

columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]

df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)

They are:

df1
   EmpID  Emp_Name  Dept_id  DeptName
0      1       NaN        1       NaN
1      2       NaN        2       NaN

df2
   EmpID Emp_Name  Dept_id DeptName
0      1      XXX        1  Science
1      2      YYY        3     Math

What you need to do is:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

to get ddf:

ddf
   EmpID Emp_Name  Dept_id DeptName
0      1      XXX        1  Science
1      2      YYY        2     Math

As you can see, only Null values in df1 have been replaced with the corresponding values in df2.

EDIT: A bit deeper explanation

Since I've been asked in the comments, let me give a bit of explanation more on the solution:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.

In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).

Pandas, fillna/bfill to concat and coalesce fields

We can approach your problem in a general way the following:

First we create a temporary column called temp which is the values backfilled.
We insert the column after your bdr column
We convert your date column to datetime
We can '|'.join the first 4 columns and create join_key

notice: step 3 I added to keep your code general, so we don't hardcode 20191031 like you did yourself.

s = df[['cusip', 'isin', 'Deal', 'Id']].replace('', np.NaN).bfill(axis=1).iloc[:, 0]
df.insert(3, 'temp', s)

df['endOfDay'] = pd.to_datetime(df['endOfDay']).dt.strftime('%Y%m%d')

df['join_key'] = df.iloc[:, :4].apply(lambda x: '|'.join(x.astype(str).to_numpy()), axis=1)
df = df.drop(columns='temp')

   endOfDay  book   bdr      cusip          isin Deal          Id                     join_key
0  20191031    15  ITOR  371494AM7  US371494AM77  161  8013210731   20191031|15|ITOR|371494AM7
1  20191031    15  ITOR                                8011898573  20191031|15|ITOR|8011898573
2  20191031    15  ITOR                                8011898742  20191031|15|ITOR|8011898742
3  20191031    15  ITOR                                8011899418  20191031|15|ITOR|8011899418

Apply SQL functions from within a DataFrame

You can use coalesce function:

import org.apache.spark.sql.functions.{coalesce, lit}

case class Foobar(foo: Option[Int], bar: Option[Int])

val df = sc.parallelize(Seq(
  Foobar(Some(1), None), Foobar(None, Some(2)),
  Foobar(Some(3), Some(4)), Foobar(None, None))).toDF

df.select(coalesce($"foo", $"bar", lit("--"))).show

// +--------------------+
// |coalesce(foo,bar,--)|
// +--------------------+
// |                   1|
// |                   2|
// |                   3|
// |                  --|
// +--------------------+