Pandas Group by and Find First Non Null Value for All Columns

pandas group by and find first non null value for all columns

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019

Pandas - find first non-null value in column

You can use first_valid_index with select by loc:

s = pd.Series([np.nan,2,np.nan])
print (s)
0 NaN
1 2.0
2 NaN
dtype: float64

print (s.first_valid_index())
1

print (s.loc[s.first_valid_index()])
2.0

# If your Series contains ALL NaNs, you'll need to check as follows:

s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index() # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None

Pandas Grouping by Id and getting non-NaN values

This should do what you what:

df.groupby('salesforce_id').first().reset_index(drop=True)

That will merge all the columns into one, keeping only the non-NaN value for each run (unless there are no non-NaN values in all the columns for that row; then the value in the final merged column will be NaN).

First non-null value per row from a list of Pandas columns

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64

Getting first non null value after group by function

So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id

IIUC you can first drop the null values and then groupby first:

df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()

Output for your example data:

anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook

Groupby and take first value without losing any value in a column

I came up with a solution. I'm using groupby and then check for each group if I actually want to deduplicate it or not. It seems highly inefficient but it does the trick. If someone can come up with a better solution, I will gladly accept it.

First, we add columns to count the number of unique IDs per group, and columns containing a boolean indicating if for each row, there is an ID or not.
And lastly, a count of number of IDs per row (useful for sorting the dataframe later).

df["first_id_count"] = df.groupby(["name", "city"])["my_first_id"].transform('nunique')
df["second_id_count"] = df.groupby(["name", "city"])["my_second_id"].transform('nunique')

def check_if_id_is_present(x):
try:
if not(pd.isnull(x)):
return True
else:
return False
except:
return False

df["my_first_id_present"] = df["my_first_id"].apply(check_if_id_is_present)
df["my_second_id_present"] = df["my_second_id"].apply(check_if_id_is_present)

def create_count_ids_per_row(x):
try:
count = 0
if not(pd.isnull(x[0])):
count += 1
if not(pd.isnull(x[1])):
count += 1
return count
except:
return 0

df["ids_count"] = df[["my_first_id", "my_second_id"]].apply(create_count_ids_per_row, axis=1)

Then, we can start the groupby and iterate over each group.

df_final = pd.DataFrame()
ids_to_deduplicate = ["first_id_count", "second_id_count"]
ids_present = ["my_first_id_present", "my_second_id_present"]

for name, group in grouped:
if group["first_id_count"].iloc[0] < 2 and group["second_id_count"].iloc[0] < 2:
# if there are strictly less than 2 unique ids of my_first_id and my_second_id
# then we can safely deduplicate and add to the final dataframe
df_final = pd.concat([df_final, group.groupby(["name", "city"]).first().reset_index()])
else:
# if not, we have to separate the dataframe into 2
# one we want to deduplicate
# one we shouldn't touch
df_duplicate = group.copy()
# first, we sort by the number of ids per row
df_duplicate = df_duplicate.sort_values(by=["ids_count"], ascending=False)
# and reset the index, for ease of use
df_duplicate.reset_index(inplace=True)
df_duplicate = df_duplicate.drop("index", axis=1)

# rows we want to deduplicate
rows_to_deduplicate = []
# rows we want to keep
rows_to_keep = []
# create a list with flags for each id column
flags_list = [False]*len(ids_to_deduplicate)
for idx in df_duplicate.index:
flag = False
# first, we check if one of our flags is set to True at the same time as our row in df_duplicate
# in this case, it means that we already have a row with an id, and we have have another row with another id
# so we want to keep this row to not lose any information
for idx_id, my_id in enumerate(ids_to_deduplicate):
if flags_list[idx_id] and df_duplicate.loc[idx, ids_present[idx_id]]:
# we add it to the rows to keep
rows_to_keep.append(idx)
flag = True
break
if flag:
continue
# now, we know that we want to deduplicate this row, otherwise we would have flaged this row
for idx_id, my_id in enumerate(ids_to_deduplicate):
if not(flags_list[idx_id]) and df_duplicate.loc[idx, ids_present[idx_id]]:
# we add it to the rows to deduplicate
rows_to_deduplicate.append(idx)
# we have to add to the flags_list all the according booleans
# there can be several ids on a row so we have to make a for loop
for idx_id_temp, id_to_check in enumerate(ids_to_deduplicate):
if df_duplicate.loc[idx, ids_present[idx_id_temp]]:
flags_list[idx_id_temp] = True
break

# now we have our 2 separate dataframes
df_duplicate_keep = df_duplicate.loc[rows_to_keep].copy()
df_duplicate_deduplicate = df_duplicate.loc[rows_to_deduplicate].copy()

# we can keep one, and deduplicate the other and concatenate the result
df_final_duplicate = pd.concat([df_duplicate_keep, df_duplicate_deduplicate.groupby(["name", "city"]).first().reset_index()])

# and add the result to our final dataframe
df_final = pd.concat([df_final, df_final_duplicate])

And to clean our mess:

df_final = df_final.drop("ids_count", axis=1)
for col in ids_to_deduplicate:
df_final = df_final.drop(col, axis=1)
for col in ids_present:
df_final = df_final.drop(col, axis=1)

And we have the desired output.

Again, this seems really ugly, so if anyone has a better solution, feel free to share.

How do I get the first non-null value from multiple columns based on another datetime column order and grouped by ID?

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.agg(lambda x:x.bfill().head(1)).reset_index())

id site type
0 111 A 1.0
1 222 C 1.0

Note that if YOU ARE SURE there is ATLEAST 1 NON-NAN per id then you can do:

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.first().reset_index())

id site type
0 111 A 1.0
1 222 C 1.0

Fill all values in a group with the first non-null value in that group

Edit

The following seems better:

nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)

print(df)

Original

I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:

import pandas as pd
import math

# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'

Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.

# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}

# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]

print(df)

Output:


cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A

Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:

# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]

to see which is more efficient (my guess is the former, without the checks).



Related Topics



Leave a reply



Submit