Pandas Group by and Find First Non Null Value for All Columns

pandas group by and find first non null value for all columns

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

Pandas - find first non-null value in column

You can use first_valid_index with select by loc:

s = pd.Series([np.nan,2,np.nan])
print (s)
0    NaN
1    2.0
2    NaN
dtype: float64

print (s.first_valid_index())
1

print (s.loc[s.first_valid_index()])
2.0

# If your Series contains ALL NaNs, you'll need to check as follows:

s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index()  # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None

Pandas Grouping by Id and getting non-NaN values

This should do what you what:

df.groupby('salesforce_id').first().reset_index(drop=True)

That will merge all the columns into one, keeping only the non-NaN value for each run (unless there are no non-NaN values in all the columns for that row; then the value in the final merged column will be NaN).

First non-null value per row from a list of Pandas columns

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64

Getting first non null value after group by function

So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id

IIUC you can first drop the null values and then groupby first:

df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()

Output for your example data:

anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098    facebook
0002226e-26a4-4f55-9578-2eff2999de7e    facebook

Groupby and take first value without losing any value in a column

I came up with a solution. I'm using groupby and then check for each group if I actually want to deduplicate it or not. It seems highly inefficient but it does the trick. If someone can come up with a better solution, I will gladly accept it.

First, we add columns to count the number of unique IDs per group, and columns containing a boolean indicating if for each row, there is an ID or not.
And lastly, a count of number of IDs per row (useful for sorting the dataframe later).

df["first_id_count"] = df.groupby(["name", "city"])["my_first_id"].transform('nunique')
df["second_id_count"] = df.groupby(["name", "city"])["my_second_id"].transform('nunique')

def check_if_id_is_present(x):
    try:
        if not(pd.isnull(x)):
            return True
        else:
            return False
    except:
        return False
    
df["my_first_id_present"] = df["my_first_id"].apply(check_if_id_is_present)
df["my_second_id_present"] = df["my_second_id"].apply(check_if_id_is_present)

def create_count_ids_per_row(x):
    try:
        count = 0
        if not(pd.isnull(x[0])):
            count += 1
        if not(pd.isnull(x[1])):
            count += 1
        return count
    except:
        return 0

df["ids_count"] = df[["my_first_id", "my_second_id"]].apply(create_count_ids_per_row, axis=1)

Then, we can start the groupby and iterate over each group.

df_final = pd.DataFrame()
ids_to_deduplicate = ["first_id_count", "second_id_count"]
ids_present = ["my_first_id_present", "my_second_id_present"]

for name, group in grouped:
    if group["first_id_count"].iloc[0] < 2 and group["second_id_count"].iloc[0] < 2:
        # if there are strictly less than 2 unique ids of my_first_id and my_second_id
        # then we can safely deduplicate and add to the final dataframe
        df_final = pd.concat([df_final, group.groupby(["name", "city"]).first().reset_index()])
    else:
        # if not, we have to separate the dataframe into 2
        # one we want to deduplicate
        # one we shouldn't touch
        df_duplicate = group.copy()
        # first, we sort by the number of ids per row
        df_duplicate = df_duplicate.sort_values(by=["ids_count"], ascending=False)
        # and reset the index, for ease of use
        df_duplicate.reset_index(inplace=True)
        df_duplicate = df_duplicate.drop("index", axis=1)

        # rows we want to deduplicate
        rows_to_deduplicate = []
        # rows we want to keep
        rows_to_keep = []
        # create a list with flags for each id column
        flags_list = [False]*len(ids_to_deduplicate)
        for idx in df_duplicate.index:
            flag = False
            # first, we check if one of our flags is set to True at the same time as our row in df_duplicate
            # in this case, it means that we already have a row with an id, and we have have another row with another id
            # so we want to keep this row to not lose any information
            for idx_id, my_id in enumerate(ids_to_deduplicate):
                if flags_list[idx_id] and df_duplicate.loc[idx, ids_present[idx_id]]:
                    # we add it to the rows to keep
                    rows_to_keep.append(idx)
                    flag = True
                    break
            if flag:
                continue
            # now, we know that we want to deduplicate this row, otherwise we would have flaged this row
            for idx_id, my_id in enumerate(ids_to_deduplicate):
                if not(flags_list[idx_id]) and df_duplicate.loc[idx, ids_present[idx_id]]:
                    # we add it to the rows to deduplicate
                    rows_to_deduplicate.append(idx)
                    # we have to add to the flags_list all the according booleans
                    # there can be several ids on a row so we have to make a for loop
                    for idx_id_temp, id_to_check in enumerate(ids_to_deduplicate): 
                        if df_duplicate.loc[idx, ids_present[idx_id_temp]]:
                            flags_list[idx_id_temp] = True
                    break

        # now we have our 2 separate dataframes
        df_duplicate_keep = df_duplicate.loc[rows_to_keep].copy()
        df_duplicate_deduplicate = df_duplicate.loc[rows_to_deduplicate].copy()

        # we can keep one, and deduplicate the other and concatenate the result
        df_final_duplicate = pd.concat([df_duplicate_keep, df_duplicate_deduplicate.groupby(["name", "city"]).first().reset_index()])
        
        # and add the result to our final dataframe
        df_final = pd.concat([df_final, df_final_duplicate])

And to clean our mess:

df_final = df_final.drop("ids_count", axis=1)
for col in ids_to_deduplicate:
    df_final = df_final.drop(col, axis=1)
for col in ids_present:
    df_final = df_final.drop(col, axis=1)

And we have the desired output.

Again, this seems really ugly, so if anyone has a better solution, feel free to share.

How do I get the first non-null value from multiple columns based on another datetime column order and grouped by ID?

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
   .agg(lambda x:x.bfill().head(1)).reset_index())

    id site  type
0  111    A   1.0
1  222    C   1.0

Note that if YOU ARE SURE there is ATLEAST 1 NON-NAN per id then you can do:

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
   .first().reset_index())

    id site  type
0  111    A   1.0
1  222    C   1.0

Fill all values in a group with the first non-null value in that group

Edit

The following seems better:

nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)

print(df)

Original

I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:

import pandas as pd
import math

# Build your DataFrame
df = pd.DataFrame.from_items([
    ('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
    ('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'

Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.

# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}

# Apply
for i, row in df.iterrows():
    df.at[i,'Value'] = nan_map[row['cluster']]

print(df)

Output:


    cluster Value
0         1     A
1         1     A
2         1     A
3         1     A
4         1     A
5         2     B
6         2     B
7         2     B
8         2     B
9         3     C
10        3     C
11        3     C
12        3     C
13        4     S
14        4     S
15        4     S
16        5     A
17        5     A
18        5     A
19        5     A

Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:

# Apply
for i, row in df.iterrows():
    if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
        df.at[i,'Value'] = nan_map[row['cluster']]

to see which is more efficient (my guess is the former, without the checks).

Pandas Group by and Find First Non Null Value for All Columns