Why does concatenation of DataFrames get exponentially slower?
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
pd.concat
returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop
(assuming each x
has size 1):
super_x = pd.concat([super_x, x], axis=0)
| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2
. So there is O(N**2)
copies required to
complete the loop.
Now consider
super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)
Appending to a list is an O(1)
operation and does not require copying. Now
there is a single call to pd.concat
after the loop is done. This call topd.concat
requires N copies to be made, since super_x
contains N
DataFrames of size 1. So when constructed this way, super_x
requires O(N)
copies.
Concatenating dataframes in a loop very slow
You need to append dataframes to your list and not the data as list.
Try:
interimdf.append(pd.DataFrame(listdf))
Then outside of your loop,
pd.concat(interimdf)
Merging 1300 data frames into a single frame becomes really slow
The reason your loop slows down is because of at each .append()
, the dataframe has to create a copy in order to allocate more memory, as described here.
If your memory can fit it all, you could first fill a list of fixed size(1300) with all data frames, and then use df = pd.concat(list_of_dataframes)
, which would probably avoid the issue you are having right now. Your code could be adjusted as such:
import pandas as pd
lst = [None for _ in range(1300)] # Creates empty list
for i, filename in enumerate(os.listdir(filepath)):
file_path = os.path.join(filepath, filename)
df = pd.read_csv(file_path,index_col=0)
df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
lst[i] = df
frame = pd.concat(lst)
How to concat thousands of pandas dataframes generated by a for loop efficiently?
You can create list of DataFrames and then use concat
only once:
dfs = []
for i in range(1,1000): # demo only
global combined
generate_df() # df is created here
dfs.append(df)
combined = pd.concat(dfs)
Why does concatenation of DataFrames get exponentially slower?
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
pd.concat
returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop
(assuming each x
has size 1):
super_x = pd.concat([super_x, x], axis=0)
| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2
. So there is O(N**2)
copies required to
complete the loop.
Now consider
super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)
Appending to a list is an O(1)
operation and does not require copying. Now
there is a single call to pd.concat
after the loop is done. This call topd.concat
requires N copies to be made, since super_x
contains N
DataFrames of size 1. So when constructed this way, super_x
requires O(N)
copies.
Merging 1300 data frames into a single frame column-wise becomes really slow
The reason that your code is slowing down is the same issue that is in your linked question: quadratic copy. In each loop, you are copying the entire existing dataframe plus some new data. The solution is to store all of the individual dataframes in a list, and then concatenate once all files have been read.
frame = []
for filename in os.listdir(filepath): #filepath has the rest of the files
file_path = os.path.join(filepath, filename)
df = (pd.read_csv(file_path, index_col=0)
.groupby(['Date']).first()
.add_prefix(f"{filename}-"))
frame.append(df)
frame = pd.concat(frame, axis=1)
To illustrate the concept with some sample data:
df1 = pd.DataFrame(
{'Date': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'],
'A': [4, 5, 55, 6],
'B': [7, 8, 85, 9]}
).set_index('Date')
df2 = pd.DataFrame(
{'Date': ['2020-01-02', '2020-01-03', '2020-01-04'],
'A': [40, 50, 60],
'C': [70, 80, 90]}
).set_index('Date')
frame = []
for n, df in enumerate([df1, df2]): #filepath has the rest of the files
df = (df.groupby(level=['Date']).first()
.add_prefix(f"{n}-"))
frame.append(df)
frame = pd.concat(frame, axis=1)
>>> frame
0-A 0-B 1-A 1-C
2020-01-01 4.0 7.0 NaN NaN
2020-01-02 5.0 8.0 40.0 70.0
2020-01-03 6.0 9.0 50.0 80.0
2020-01-04 NaN NaN 60.0 90.0
fast concatenation of large amount of homogeneous dataframes
Here is an example to use pd.HDFStore
to append many tables together.
import pandas as pd
import numpy as np
from time import time
# your tables
# =========================================
columns = ['col{}'.format(i) for i in range(100)]
data = np.random.randn(100000).reshape(1000, 100)
df = pd.DataFrame(data, columns=columns)
# many tables, generator
def get_generator(df, n=1000):
for x in range(n):
yield df
table_reader = get_generator(df, n=1000)
# processing
# =========================================
# create a hdf5 storage, compression level 5, (1-9, 9 is extreme)
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5', complevel=5, complib='blosc')
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
Empty
t0 = time()
# loop over your df
counter = 1
for frame in table_reader:
print('Appending Table {}'.format(counter))
h5_file.append('big_table', frame, complevel=5, complib='blosc')
counter += 1
t1 = time()
# Appending Table 1
# Appending Table 2
# ...
# Appending Table 999
# Appending Table 1000
print(t1-t0)
Out[3]: 41.6630880833
# check our hdf5_file
h5_file
Out[7]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
/big_table frame_table (typ->appendable,nrows->1000000,ncols->100,indexers->[index])
# close hdf5
h5_file.close()
# very fast to retrieve your data in any future IPython session
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5')
%time my_big_table = h5_file['big_table']
CPU times: user 217 ms, sys: 1.11 s, total: 1.33 s
Wall time: 1.89 s
Speed up concatenation of excel files with Pandas
I don't see anything wrong with your pandas code; a 300mb excel file might just be a lot for Pandas to handle! Here are some approaches I'd take:
Tactic 1. Investigate
If I were you, my next step in debugging this would be to throw some print(datetime.now())
statements into the loop, to see whether it's the reading, the concatenating, or the .to_excel
that's taking time. That way you may be able to narrow down the problem. Also take a look at your memory usage using appropriate tools for whatever OS you're in.
Tactic 2. Try a different tool
Pandas is optimized for scientific computing and it probably spends quite a bit of time organizing the data for querying and such. ETL isn't it's primary purpose. If you only need to concatenate a few sheets, (as much as it pains me to suggest doing something manually!) manual work in Excel itself will likely be the quickest way to go - highly-paid engineers at Microsoft have been tasked with optimizing that. If you need a programmatic approach, it could be worth trying out petl or one of the tools discussed here that may take a simpler/more efficient approach than pandas.
Some example petl
code that might do the trick:
import petl
petl.cat(*
petl.io.fromxlsx(file)
for file in ['your.xlsx', 'excel.xlsx', 'files.xlsx']
).progress().toxlsx()
Related Topics
How to Import a Module Given the Full Path
How to List All Files of a Directory
How to Add to the Pythonpath in Windows, So It Finds My Modules/Packages
Syntax Error on Print With Python 3
Are Dictionaries Ordered in Python 3.6+
How to Remove Duplicates from a List, While Preserving Order
Is There a Difference Between "==" and "Is"
How to Count the Occurrences of a List Item
Is There a Built in Function For String Natural Sort
How to Make Function Decorators and Chain Them Together
How to Disable CSS in Python Selenium Using Chromeoptions
Sending Data from HTML Form to a Python Script in Flask
What Do I Use on Linux to Make a Python Program Executable