First non-null value per row from a list of Pandas columns
This is a really messy way to do this, first use first_valid_index
to get the valid columns, convert the returned series to a dataframe so we can call apply
row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
Get first non-null value per row
Use back filling NaN
s first and then select first column by iloc
:
df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
Or:
df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
print (df)
ID c1 c2 c3 c4 result
0 1 a b a NaN a
1 2 NaN cc dd cc cc
2 3 NaN ee ff ee ee
3 4 NaN NaN gg gg gg
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop
In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop
#jpp solution
In [222]: %%timeit
...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
...:
...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
...:
1 loop, best of 3: 180 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop
First column name with non null value by row pandas
You can apply first_valid_index
to each row in the dataframe using a lambda expression with axis=1 to specify rows.
>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object
To apply it to your dataframe:
df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))
>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1
Pandas - find first non-null value in column
You can use first_valid_index
with select by loc
:
s = pd.Series([np.nan,2,np.nan])
print (s)
0 NaN
1 2.0
2 NaN
dtype: float64
print (s.first_valid_index())
1
print (s.loc[s.first_valid_index()])
2.0
# If your Series contains ALL NaNs, you'll need to check as follows:
s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index() # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None
How to take the first non null element, row-wise, from a column that consists of lists?
You can do this directly, without the need for fifth_colum
. Just stack
the data frame. Since you want the first non-null element per row, your group is the first index (level=0
). So just get the first value by that group.
x['sixth_col'] = x.stack().groupby(level=0).first()
col_1 col_2 col_3 col_4 sixth_col
0 NaN 15.0 12.0 NaN 15.0
1 35.0 12.0 15.0 NaN 35.0
2 27.0 NaN 40.0 NaN 27.0
3 50.0 NaN NaN 5.0 50.0
Keep only the 1st non-null value in each row (and replace others with NaN)
One way to go, would be:
import pandas as pd
data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}
df = pd.DataFrame(data)
def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)
df = df.apply(lambda x: keep_first_valid(x), axis=1)
df
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
- So, the first
x
passed to the function would consist ofpd.Series([3.0, 10.0],index=['a','b'])
. - Inside the function
first_valid = x.first_valid_index()
will store 'a'; seedf.first_valid_index
. - Finally, we apply
s.mask
to getpd.Series([3.0, None],index=['a','b'])
, which we assign back to thedf
.
pandas group by and find first non null value for all columns
Use GroupBy.first
:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year
is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
Related Topics
What's the Difference Between %S and %D in Python String Formatting
Split a Python List into Other "Sublists" I.E Smaller Lists
How to Unzip a List of Tuples into Individual Lists
Validate Ssl Certificates with Python
How to Set Class Attributes from Variable Arguments (Kwargs) in Python
Convert Rgba Png to Rgb with Pil
Django Db Settings 'Improperly Configured' Error
How to Use Multiprocessing Queue in Python
Python: Pandas Series - Why Use Loc
Custom Filter in Django Admin on Django 1.3 or Below
Reading a Text File and Splitting It into Single Words in Python
How to Plot a Gradient Color Line in Matplotlib
How to Pipe a Subprocess Call to a Text File
Example Use of "Continue" Statement in Python
How to Get Rid of Python Tkinter Root Window
Why Is Bubble Sort Implementation Looping Forever