Dataframe Processing

Processing data in one column using split and index based on value in other column pandas

Let us try groupby with factorize then map

df['new'] = df.groupby('Split_key',as_index=False).apply(lambda x :  pd.Series(x['label'].factorize()[0]).map(dict(enumerate(x['Split_key'].iloc[0].split('_'))))).values
df
Out[869]:
Split_key label sub_label new
0 A_B_C 7 NaN A
1 A_B_C 7 NaN A
2 A_B_C 8 NaN B
3 A_B_C 8 NaN B
4 A_B_C 10 NaN C
5 A_B_C 10 NaN C
6 D_E_F 2 NaN D
7 D_E_F 7 NaN E
8 D_E_F 15 NaN F
9 G_H_I 1 NaN G
10 G_H_I 2 NaN H
11 G_H_I 3 NaN I

processing a dataframe in parallel

While iterating over rows isnt good practice and there can be alternate logics with grouby/transform aggregations etc, but if in worst case you really need to do so, follow the answer. Also, you might not need to reimplement everything here and you can use libraries like Dask, which is built on top of pandas.

But just to give Idea, you can use multiprocessing (Pool.map) in combination with chunking. read csv in chunk(or make chucks as mentioned in the end of answer) and map it to the pools, in processing each chunk add new rows (or add them to list and make new chunk) and return it from the function.

In the end combine the dataframes when all pools are executed.

import pandas as pd
import numpy as np
import multiprocessing

def process_chunk(df_chunk):

for index, row in df_chunk.reset_index(drop = True).iterrows():
#your logic for updating this chunk or making new chunk here

print(row)

print("index is " + str(index))
#if you can added to same df_chunk, return it, else if you appended
#rows to have list_of_rows, make a new df with them and return
#pd.Dataframe(list_of_rows)

return df_chunk

if __name__ == '__main__':
#use all available cores , otherwise specify the number you want as an argument,
#for example if you have 12 cores, leave 1 or 2 for other things
pool = multiprocessing.Pool(processes=10)

results = pool.map(process_chunk, [c for c in pd.read_csv("your_csv.csv", chunksize=7150)])
pool.close()
pool.join()

#make new df by concatenating

concatdf = pd.concat(results, axis=0, ignore_index=True)

Note: Instead of reading csv you can pass chucks by the same logic, to calculate chunk-size you might want something like round_of( (length of df) / (number of core available-2)) eg 100000/14 = round(7142.85) = 7150 rows per chunk

 results = pool.map(process_chunk,
[df[c:c+chunk_size] for c in range(0,len(df),chunk_size])

process columns in pandas dataframe

new_col = []
for idx, row in df.iterrows():

val1 = row["Col1"]
val2 = row["Col2"]
val3 = row["Col3"]

new_val2 = f",[{val2}]" if pd.notna(val2) else ""
new_val3 = f",[{val3}]" if pd.notna(val3) else ""

val4 = f"{val1}{new_val2}{new_val3}"
new_col.append(val4)

df["Col4"] = new_col

Maybe my answer is not the most "computationally efficient", but if your dataset is 20k rows, it will be fast enough!
I think my answer is very easy to read, and it is also easy to adapt it to different scenarios!

Line-by-line processing of pandas DataFrame

there are vectorized functions for just this purpose that will be much faster:

df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan]))
df.ffill()

# df
a b
0 1.0 0.0
1 1.0 1.0
2 NaN 0.0
3 NaN NaN

# output
a b
0 1.0 0.0
1 1.0 1.0
2 1.0 0.0
3 1.0 0.0


Related Topics



Leave a reply



Submit