Progress indicator during pandas operations
Due to popular demand, I've added pandas
support in tqdm
(pip install "tqdm>=4.9.0"
). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply
:
import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)
In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm)
. Other supported functions include map
, applymap
, aggregate
, and transform
.
EDIT
To directly answer the original question, replace:
df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)
with:
from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)
Note: tqdm <= v4.8:
For versions of tqdm below 4.8, instead of tqdm.pandas()
you had to do:
from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())
How do I use tqdm to show progress bars when using read_csv in a Jupiter notebook using jupyterlab
Yes. You could abuse any of the number of arguments that accept a callable and call it at each row:
from tqdm.auto import tqdm
with tqdm() as bar:
# do not skip any of the rows, but update the progress bar instead
pd.read_csv('data.csv', skiprows=lambda x: bar.update(1) and False)
If you use Linux, you can get the total number of lines to get a more meaningful progress bar:
from tqdm.auto import tqdm
lines_number = !cat 'data.csv' | wc -l
with tqdm(total=int(lines_number[0])) as bar:
pd.read_csv('data.csv', skiprows=lambda x: bar.update(1) and False)
But if you do not like for-loops, you may also dislike context managers. You could get away with:
def none_but_please_show_progress_bar(*args, **kwargs):
bar = tqdm(*args, **kwargs)
def checker(x):
bar.update(1)
return False
return checker
pd.read_csv('data.csv', skiprows=none_but_please_show_progress_bar())
But I find it less stable - I do recommend to use the context manager based approach.
How to use tqdm with pandas in a jupyter notebook?
You can use:
tqdm_notebook().pandas(*args, **kwargs)
This is because tqdm_notebook has a delayer adapter, so it's necessary to instanciate it before accessing its methods (including class methods).
In the future (>v5.1), you should be able to use a more uniform API:
tqdm_pandas(tqdm_notebook, *args, **kwargs)
Is it possible to use tqdm for pandas merge operation?
tqdm supports pandas and various operations within it. For merging two large dataframes and showing the progress, you could do it this way:
import pandas as pd
from tqdm import tqdm
df1 = pd.DataFrame({'lkey': 1000*['a', 'b', 'c', 'd'],'lvalue': np.random.randint(0,int(1e8),4000)})
df2 = pd.DataFrame({'rkey': 1000*['a', 'b', 'c', 'd'],'rvalue': np.random.randint(0, int(1e8),4000)})
#this is how you activate the pandas features in tqdm
tqdm.pandas()
#call the progress_apply feature with a dummy lambda
df1.merge(df2, left_on='lkey', right_on='rkey').progress_apply(lambda x: x)
More details are available on this thread:
Progress indicator during pandas operations (python)
how to use tqdm progress bar in dask_cudf and cudf
Until progress_apply
is available, you would have to implement an equivalent yourself (e.g. using apply_chunks
). Just a sketch of the code:
full_size = 100
t = tqdm(total=full_size)
def chunks_generator():
chunk_size = 5
for s in range(0,full_size,chunk_size):
yield s
t.update(s)
df.apply_chunks(..., chunks=chunks_generator())
TQDM on pandas df.describe()
You can use it like this:
tqdm.pandas(desc="my bar!")
df.progress_apply(lambda x: x.describe())
Although it doesn't seem to be useful.
Related Topics
Windowserror: [Error 126] the Specified Module Could Not Be Found
How to Add Pandas Data to an Existing CSV File
Defining and Calling a Function Within a Python Class
Sqlalchemy, Prevent Duplicate Rows
Webscraping Financial Data from Morningstar
Best Practices for Adding .Gitignore File for Python Projects
Pandas - Find Index of Value Anywhere in Dataframe
Typeerror: Unsupported Format String Passed to List._Format_
How to Loop Over Grouped Pandas Dataframe
Find Row Where Values for Column Is Maximal in a Pandas Dataframe
Use Tqdm Progress Bar With Pandas
Most Efficient Way to Construct Similarity Matrix
Python: Searching for Common Values in Two Files
Using Continue in a Try and Except Inside While-Loop
Stuck With Loops in Python - Only Returning First Value
Using Beautifulsoup to Extract Text from Div
How to Automatically Download Files from a Pop Up Dialog Using Selenium-Python