First Non-Null Value Per Row from a List of Pandas Columns

First non-null value per row from a list of Pandas columns

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64

Get first non-null value per row

Use back filling NaNs first and then select first column by iloc:

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

Or:

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
ID c1 c2 c3 c4 result
0 1 a b a NaN a
1 2 NaN cc dd cc cc
2 3 NaN ee ff ee ee
3 4 NaN NaN gg gg gg

Performance:

df = pd.concat([df] * 1000, ignore_index=True)

In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
...:
...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
...:
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop

First column name with non null value by row pandas

You can apply first_valid_index to each row in the dataframe using a lambda expression with axis=1 to specify rows.

>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object

To apply it to your dataframe:

df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))

>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1

Pandas - find first non-null value in column

You can use first_valid_index with select by loc:

s = pd.Series([np.nan,2,np.nan])
print (s)
0 NaN
1 2.0
2 NaN
dtype: float64

print (s.first_valid_index())
1

print (s.loc[s.first_valid_index()])
2.0

# If your Series contains ALL NaNs, you'll need to check as follows:

s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index() # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None

How to take the first non null element, row-wise, from a column that consists of lists?

You can do this directly, without the need for fifth_colum. Just stack the data frame. Since you want the first non-null element per row, your group is the first index (level=0). So just get the first value by that group.

x['sixth_col'] = x.stack().groupby(level=0).first()

col_1 col_2 col_3 col_4 sixth_col
0 NaN 15.0 12.0 NaN 15.0
1 35.0 12.0 15.0 NaN 35.0
2 27.0 NaN 40.0 NaN 27.0
3 50.0 NaN NaN 5.0 50.0

Keep only the 1st non-null value in each row (and replace others with NaN)

One way to go, would be:

import pandas as pd

data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}

df = pd.DataFrame(data)

def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)

df = df.apply(lambda x: keep_first_valid(x), axis=1)
df

a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
  • So, the first x passed to the function would consist of pd.Series([3.0, 10.0],index=['a','b']).
  • Inside the function first_valid = x.first_valid_index() will store 'a'; see df.first_valid_index.
  • Finally, we apply s.mask to get pd.Series([3.0, None],index=['a','b']), which we assign back to the df.

pandas group by and find first non null value for all columns

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019


Related Topics



Leave a reply



Submit