Progress Indicator During Pandas Operations

Progress indicator during pandas operations

Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:

import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.

EDIT


To directly answer the original question, replace:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

with:

from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)

Note: tqdm <= v4.8:
For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:

from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())

Progress bar for pandas DataFrame multy operations with .agg()

This will print the progress as you go, where progress is measured by the fraction of the groups for which statistics are computed. But I'm not sure how much the loop will slow down your computations.

agger = {
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()}

gcols = ['B'] # columns defining the groups
groupby = df.groupby(gcols)

ngroups = len(groupby)
gfrac = 0.3 # fraction of groups for which you want to print progress
gfrac_size = max((1, int(ngroups*gfrac)))
groups = []
rows = []
for i,g in enumerate(groupby):

if (i+1)%gfrac_size == 0:
print('{:.0f}% complete'.format(100*(i/ngroups)))

gstats = g[1].agg(agger)
if i==0:
if gstats.ndim==2:
newcols = gstats.columns.tolist()
else:
newcols = gstats.index.tolist()

groups.append(g[0])
rows.append(gstats.values.flat)

df3 = pd.DataFrame(np.vstack(rows), columns=newcols)
if len(gcols) == 1:
df3.index = groups
else:
df3.index = pd.MultiIndex.from_tuples(groups, names=gcols)
df3 = df3.astype(df[newcols].dtypes)
df3
C D E
1.0 1.5 10.0 a
2.0 3.0 12.0 b
3.0 7.0 8.0 a

An alternative (though somewhat hacky) way would be to take advantage of the fact that you use your own function lambda x: x.mode. Since you're already compromising speed using this function, you can write a class that stores information about progress. For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})

df2 = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
print(df2)

class ModeHack:

def __init__(self, size=5, N=10):
self.ix = 0
self.K = 1
self.size = size
self.N = N

def mode(self, x):
self.ix = self.ix + x.shape[0]
if self.K*self.size <= self.ix:
print('{:.0f}% complete'.format(100*self.ix/self.N))
self.K += 1

return x.mode()

def reset(self):
self.ix = 0
self.K = 1

mymode = ModeHack(size=int(.1*df.shape[0]), N=df.shape[0])
mymode.reset()

agger = {
'C': 'mean',
'D': 'sum',
'E': lambda x: mymode.mode(x)}

df3 = df.groupby('B').agg(agger)

TQDM on pandas df.describe()

You can use it like this:

tqdm.pandas(desc="my bar!")
df.progress_apply(lambda x: x.describe())

Although it doesn't seem to be useful.

A progress bar for my function (Python, pandas)

You can easily do this with Dask. For example:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

ddf = dd.read_csv(path, blocksize=1e+6)

with ProgressBar():
df = ddf.compute()
[########################################] | 100% Completed | 37.0s

And you will see the file download process.
the blocksize parameter is responsible for the blocks that your file is read with. By changing it, you can achieve good performance. Plus, Dask uses several threads for reading by default, which will speed up the reading process itself.

How can you show progress bar while iterating over a pandas dataframe

Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.

I would suggest you use apply/ avoid using iterrows().

If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]

How to use tqdm with pandas in a jupyter notebook?

You can use:

tqdm_notebook().pandas(*args, **kwargs)

This is because tqdm_notebook has a delayer adapter, so it's necessary to instanciate it before accessing its methods (including class methods).

In the future (>v5.1), you should be able to use a more uniform API:

tqdm_pandas(tqdm_notebook, *args, **kwargs)

Is it possible to use tqdm for pandas merge operation?

tqdm supports pandas and various operations within it. For merging two large dataframes and showing the progress, you could do it this way:

import pandas as pd
from tqdm import tqdm

df1 = pd.DataFrame({'lkey': 1000*['a', 'b', 'c', 'd'],'lvalue': np.random.randint(0,int(1e8),4000)})
df2 = pd.DataFrame({'rkey': 1000*['a', 'b', 'c', 'd'],'rvalue': np.random.randint(0, int(1e8),4000)})

#this is how you activate the pandas features in tqdm
tqdm.pandas()
#call the progress_apply feature with a dummy lambda
df1.merge(df2, left_on='lkey', right_on='rkey').progress_apply(lambda x: x)

More details are available on this thread:
Progress indicator during pandas operations (python)



Related Topics



Leave a reply



Submit