Dataframe Processing

Processing data in one column using split and index based on value in other column pandas

Let us try groupby with factorize then map

df['new'] = df.groupby('Split_key',as_index=False).apply(lambda x :  pd.Series(x['label'].factorize()[0]).map(dict(enumerate(x['Split_key'].iloc[0].split('_'))))).values
df
Out[869]: 
   Split_key  label  sub_label new
0      A_B_C      7        NaN   A
1      A_B_C      7        NaN   A
2      A_B_C      8        NaN   B
3      A_B_C      8        NaN   B
4      A_B_C     10        NaN   C
5      A_B_C     10        NaN   C
6      D_E_F      2        NaN   D
7      D_E_F      7        NaN   E
8      D_E_F     15        NaN   F
9      G_H_I      1        NaN   G
10     G_H_I      2        NaN   H
11     G_H_I      3        NaN   I

processing a dataframe in parallel

While iterating over rows isnt good practice and there can be alternate logics with grouby/transform aggregations etc, but if in worst case you really need to do so, follow the answer. Also, you might not need to reimplement everything here and you can use libraries like Dask, which is built on top of pandas.

But just to give Idea, you can use multiprocessing (Pool.map) in combination with chunking. read csv in chunk(or make chucks as mentioned in the end of answer) and map it to the pools, in processing each chunk add new rows (or add them to list and make new chunk) and return it from the function.

In the end combine the dataframes when all pools are executed.

import pandas as pd
import numpy as np
import multiprocessing

def process_chunk(df_chunk):
        
        for index, row in df_chunk.reset_index(drop = True).iterrows():
                    #your logic for updating this chunk or making new chunk here
                         
                    print(row)
                    
                    print("index is " + str(index))
        #if you can added to same df_chunk, return it, else if you appended
        #rows to have list_of_rows, make a new df with them and return
        #pd.Dataframe(list_of_rows)  

        return df_chunk   

if __name__ == '__main__':
            #use all available cores , otherwise specify the number you want as an argument,
            #for example if you have 12 cores,  leave 1 or 2 for other things
            pool = multiprocessing.Pool(processes=10) 
            
            results = pool.map(process_chunk, [c for c in pd.read_csv("your_csv.csv", chunksize=7150)])
            pool.close()
            pool.join()
            
            #make new df by concatenating
            
            concatdf = pd.concat(results, axis=0, ignore_index=True)

Note: Instead of reading csv you can pass chucks by the same logic, to calculate chunk-size you might want something like round_of( (length of df) / (number of core available-2)) eg 100000/14 = round(7142.85) = 7150 rows per chunk

 results = pool.map(process_chunk,
        [df[c:c+chunk_size] for c in range(0,len(df),chunk_size])

process columns in pandas dataframe

new_col = []
for idx, row in df.iterrows():
    
    val1 = row["Col1"]
    val2 = row["Col2"]
    val3 = row["Col3"]
    
    new_val2 = f",[{val2}]" if pd.notna(val2) else ""
    new_val3 = f",[{val3}]" if pd.notna(val3) else ""
    
    val4 = f"{val1}{new_val2}{new_val3}"
    new_col.append(val4)

df["Col4"] = new_col

Maybe my answer is not the most "computationally efficient", but if your dataset is 20k rows, it will be fast enough!
I think my answer is very easy to read, and it is also easy to adapt it to different scenarios!

Line-by-line processing of pandas DataFrame

there are vectorized functions for just this purpose that will be much faster:

df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan]))
df.ffill()

# df
     a    b
0  1.0  0.0
1  1.0  1.0
2  NaN  0.0
3  NaN  NaN

# output
     a    b
0  1.0  0.0
1  1.0  1.0
2  1.0  0.0
3  1.0  0.0

Dataframe Processing

Processing data in one column using split and index based on value in other column pandas

processing a dataframe in parallel

process columns in pandas dataframe

Line-by-line processing of pandas DataFrame

Related Topics

Leave a reply