pandas group by and find first non null value for all columns
Use GroupBy.first
:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year
is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
Pandas - find first non-null value in column
You can use first_valid_index
with select by loc
:
s = pd.Series([np.nan,2,np.nan])
print (s)
0 NaN
1 2.0
2 NaN
dtype: float64
print (s.first_valid_index())
1
print (s.loc[s.first_valid_index()])
2.0
# If your Series contains ALL NaNs, you'll need to check as follows:
s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index() # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None
Pandas Grouping by Id and getting non-NaN values
This should do what you what:
df.groupby('salesforce_id').first().reset_index(drop=True)
That will merge all the columns into one, keeping only the non-NaN value for each run (unless there are no non-NaN values in all the columns for that row; then the value in the final merged column will be NaN).
First non-null value per row from a list of Pandas columns
This is a really messy way to do this, first use first_valid_index
to get the valid columns, convert the returned series to a dataframe so we can call apply
row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
Getting first non null value after group by function
So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id
IIUC you can first drop the null values and then groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
Output for your example data:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook
Groupby and take first value without losing any value in a column
I came up with a solution. I'm using groupby and then check for each group if I actually want to deduplicate it or not. It seems highly inefficient but it does the trick. If someone can come up with a better solution, I will gladly accept it.
First, we add columns to count the number of unique IDs per group, and columns containing a boolean indicating if for each row, there is an ID or not.
And lastly, a count of number of IDs per row (useful for sorting the dataframe later).
df["first_id_count"] = df.groupby(["name", "city"])["my_first_id"].transform('nunique')
df["second_id_count"] = df.groupby(["name", "city"])["my_second_id"].transform('nunique')
def check_if_id_is_present(x):
try:
if not(pd.isnull(x)):
return True
else:
return False
except:
return False
df["my_first_id_present"] = df["my_first_id"].apply(check_if_id_is_present)
df["my_second_id_present"] = df["my_second_id"].apply(check_if_id_is_present)
def create_count_ids_per_row(x):
try:
count = 0
if not(pd.isnull(x[0])):
count += 1
if not(pd.isnull(x[1])):
count += 1
return count
except:
return 0
df["ids_count"] = df[["my_first_id", "my_second_id"]].apply(create_count_ids_per_row, axis=1)
Then, we can start the groupby and iterate over each group.
df_final = pd.DataFrame()
ids_to_deduplicate = ["first_id_count", "second_id_count"]
ids_present = ["my_first_id_present", "my_second_id_present"]
for name, group in grouped:
if group["first_id_count"].iloc[0] < 2 and group["second_id_count"].iloc[0] < 2:
# if there are strictly less than 2 unique ids of my_first_id and my_second_id
# then we can safely deduplicate and add to the final dataframe
df_final = pd.concat([df_final, group.groupby(["name", "city"]).first().reset_index()])
else:
# if not, we have to separate the dataframe into 2
# one we want to deduplicate
# one we shouldn't touch
df_duplicate = group.copy()
# first, we sort by the number of ids per row
df_duplicate = df_duplicate.sort_values(by=["ids_count"], ascending=False)
# and reset the index, for ease of use
df_duplicate.reset_index(inplace=True)
df_duplicate = df_duplicate.drop("index", axis=1)
# rows we want to deduplicate
rows_to_deduplicate = []
# rows we want to keep
rows_to_keep = []
# create a list with flags for each id column
flags_list = [False]*len(ids_to_deduplicate)
for idx in df_duplicate.index:
flag = False
# first, we check if one of our flags is set to True at the same time as our row in df_duplicate
# in this case, it means that we already have a row with an id, and we have have another row with another id
# so we want to keep this row to not lose any information
for idx_id, my_id in enumerate(ids_to_deduplicate):
if flags_list[idx_id] and df_duplicate.loc[idx, ids_present[idx_id]]:
# we add it to the rows to keep
rows_to_keep.append(idx)
flag = True
break
if flag:
continue
# now, we know that we want to deduplicate this row, otherwise we would have flaged this row
for idx_id, my_id in enumerate(ids_to_deduplicate):
if not(flags_list[idx_id]) and df_duplicate.loc[idx, ids_present[idx_id]]:
# we add it to the rows to deduplicate
rows_to_deduplicate.append(idx)
# we have to add to the flags_list all the according booleans
# there can be several ids on a row so we have to make a for loop
for idx_id_temp, id_to_check in enumerate(ids_to_deduplicate):
if df_duplicate.loc[idx, ids_present[idx_id_temp]]:
flags_list[idx_id_temp] = True
break
# now we have our 2 separate dataframes
df_duplicate_keep = df_duplicate.loc[rows_to_keep].copy()
df_duplicate_deduplicate = df_duplicate.loc[rows_to_deduplicate].copy()
# we can keep one, and deduplicate the other and concatenate the result
df_final_duplicate = pd.concat([df_duplicate_keep, df_duplicate_deduplicate.groupby(["name", "city"]).first().reset_index()])
# and add the result to our final dataframe
df_final = pd.concat([df_final, df_final_duplicate])
And to clean our mess:
df_final = df_final.drop("ids_count", axis=1)
for col in ids_to_deduplicate:
df_final = df_final.drop(col, axis=1)
for col in ids_present:
df_final = df_final.drop(col, axis=1)
And we have the desired output.
Again, this seems really ugly, so if anyone has a better solution, feel free to share.
How do I get the first non-null value from multiple columns based on another datetime column order and grouped by ID?
(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.agg(lambda x:x.bfill().head(1)).reset_index())
id site type
0 111 A 1.0
1 222 C 1.0
Note that if YOU ARE SURE there is ATLEAST 1 NON-NAN per id then you can do:
(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.first().reset_index())
id site type
0 111 A 1.0
1 222 C 1.0
Fill all values in a group with the first non-null value in that group
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map
dict, then sets the values in Value
as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).
Related Topics
Generate a Random Letter in Python
How to Pass a User Defined Argument in Scrapy Spider
How to Change the Range of the X-Axis with Datetimes in Matplotlib
How to Get Flask to Run on Port 80
Pil: Convert Bytearray to Image
Connecting to Microsoft SQL Server Using Python
Python Method for Reading Keypress
Rename Specific Column(S) in Pandas
Is There a Clever Way to Pass the Key to Defaultdict's Default_Factory
How to Execute a Python Script in Notepad++
Running Get_Dummies on Several Dataframe Columns
Chain-Calling Parent Initialisers in Python
How to Get a Thread Safe Print in Python 2.6
Shuffle an Array with Python, Randomize Array Item Order with Python
Pandas Datetime to Unix Timestamp Seconds
Split a String to Even Sized Chunks
Python - Using Pandas Structures with Large CSV(Iterate and Chunksize)