Appending pandas dataframes generated in a for loop
Use pd.concat
to merge a list of DataFrame into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
Using pandas .append within for loop
Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).
In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.
a_list = []
b_list = []
for data in my_data:
a, b = process_data(data)
a_list.append(a)
b_list.append(b)
df = pd.DataFrame({'A': a_list, 'B': b_list})
del a_list, b_list
Timings
%%timeit
data = pd.DataFrame([])
for i in np.arange(0, 10000):
if i % 2 == 0:
data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
1 loops, best of 3: 6.8 s per loop
%%timeit
a_list = []
b_list = []
for i in np.arange(0, 10000):
if i % 2 == 0:
a_list.append(i)
b_list.append(i + 1)
else:
a_list.append(i)
b_list.append(None)
data = pd.DataFrame({'A': a_list, 'B': b_list})
100 loops, best of 3: 8.54 ms per loop
How to append rows in a pandas dataframe in a for loop?
Suppose your data looks like this:
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
data = dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5)))
data = pd.DataFrame(data.items())
data = data.transpose()
data.columns = data.iloc[0]
data = data.drop(data.index[[0]])
df = df.append(data)
print('{}\n'.format(df))
# 0 0 1 2 3 4 5 6 7 8 9
# 1 6 NaN NaN 8 5 NaN NaN 7 0 NaN
# 1 NaN 9 6 NaN 2 NaN 1 NaN NaN 2
# 1 NaN 2 2 1 2 NaN 1 NaN NaN NaN
# 1 6 NaN 6 NaN 4 4 0 NaN NaN NaN
# 1 NaN 9 NaN 9 NaN 7 1 9 NaN NaN
Then it could be replaced with
np.random.seed(2015)
data = []
for i in range(5):
data.append(dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)
In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)
once at the end, outside the loop.
Each call to df.append
requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append
in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.
Python Panda append dataframe in loop
As I mentioned in my comment, appending to pandas dataframes is not considered a very good approach. Instead, I suggest that you use something more appropriate to store the data, such as a file or a database if you want scalability.
Then you can use pandas for what it's built, i.e. data analysis by just reading the contents of the database or the file into a dataframe.
Now, if you really want to stick with this approach, I suggest either join
or concat
to grow your dataframe as you get more data
[EDIT]
Example (from one of my scripts):
results = pd.DataFrame()
for result_file in result_files:
df = parse_results(result_file)
results = pd.concat([results, df], axis=0).reset_index(drop=True)
parse_results
is a function that takes a filename and returns a dataframe formatted in the right way, up to you to make it fit your needs.
How to concat thousands of pandas dataframes generated by a for loop efficiently?
You can create list of DataFrames and then use concat
only once:
dfs = []
for i in range(1,1000): # demo only
global combined
generate_df() # df is created here
dfs.append(df)
combined = pd.concat(dfs)
Concatenate pandas DataFrames generated with a loop
Pandas concat takes a list of dataframes. If you can generate a list of dataframes with your looping function, once you are finished you can concatenate the list together:
data_day_list = []
for i, day in enumerate(list_day):
data_day = df[df.day==day]
data_day_list.append(data_day)
final_data_day = pd.concat(data_day_list)
Python Append dataframe generated in nested loops
Try:
Change this
biglist.append(tem_list)
to this:biglist.append(pd.concat(tem_list))
.Remove this line:
biglist1 = [item for sublist in biglist for item in sublist]
Modify this one
df = pd.concat(biglist1)
todf = pd.concat(biglist)
If you have defined column names, you can also create an empty DataFrame outside your looping scope, and append the data directly on it from your inner loop:
# Before loop
colnames = ['y1', 'y2', 'y3']
df = pd.DataFrame(data=None, columns=colnames)
chaging your append lines to a single one inside your inner loop:
df = df.append(tem_df)
Not needed the use of biglist
, tem_list
or pd.concat
.
Edit after user comments:
biglist = []
for i in range (x1,...,x8):
for j in range ([y1,y2,y3],[y4,..]...[y22,y23,y24]):
tem_df = pd.DataFrame({'y1':[value1],'y2':[value2],'y3':[value3]},index=i)
biglist.append(pd.concat(tem_df),axis=1)
df = pd.concat(biglist)
print(df)
Is pd.append() the quickest way to join two dataframes?
When you have multiple appends in series, it is often more efficient to create a list of dataframes and to concatenate it at the end than using the pd.append function at each iteration since there is some overhead with the pandas functions.
For example,
%%timeit
dfs= []
for i in range(10000):
tmp1 = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]])
dfs.append(tmp1)
pd.concat(dfs)
gives 1.44 s ± 88.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
where the same implementation but using append at each iteration gives
2.81 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related Topics
Draw a Transparent Rectangles and Polygons in Pygame
What's the Best Way to Parse Command Line Arguments
How to Get My Program to Sleep for 50 Milliseconds
Weird Try-Except-Else-Finally Behavior with Return Statements
Thread Starts Running Before Calling Thread.Start
How to Remove Specific Elements in a Numpy Array
Any Way to Modify Locals Dictionary
Get the Key Corresponding to the Minimum Value Within a Dictionary
List VS Generator Comprehension Speed with Join Function
Finding a Key Recursively in a Dictionary
Multiple Modeladmins/Views for Same Model in Django Admin
Setting Camera Parameters in Opencv/Python
Testing Code That Requires a Flask App or Request Context
How to Set the Value of a Pandas Column as List
How to Switch Position of Two Items in a Python List
How to Check Mousebuttonpress Event in Pyqt6
Most Efficient Way of Making an If-Elif-Elif-Else Statement When the Else Is Done the Most