Appending pandas dataframes generated in a for loop
Use pd.concat
to merge a list of DataFrame into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
Using pandas .append within for loop
Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).
In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.
a_list = []
b_list = []
for data in my_data:
a, b = process_data(data)
a_list.append(a)
b_list.append(b)
df = pd.DataFrame({'A': a_list, 'B': b_list})
del a_list, b_list
Timings
%%timeit
data = pd.DataFrame([])
for i in np.arange(0, 10000):
if i % 2 == 0:
data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
1 loops, best of 3: 6.8 s per loop
%%timeit
a_list = []
b_list = []
for i in np.arange(0, 10000):
if i % 2 == 0:
a_list.append(i)
b_list.append(i + 1)
else:
a_list.append(i)
b_list.append(None)
data = pd.DataFrame({'A': a_list, 'B': b_list})
100 loops, best of 3: 8.54 ms per loop
How to append rows in a pandas dataframe in a for loop?
Suppose your data looks like this:
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
data = dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5)))
data = pd.DataFrame(data.items())
data = data.transpose()
data.columns = data.iloc[0]
data = data.drop(data.index[[0]])
df = df.append(data)
print('{}\n'.format(df))
# 0 0 1 2 3 4 5 6 7 8 9
# 1 6 NaN NaN 8 5 NaN NaN 7 0 NaN
# 1 NaN 9 6 NaN 2 NaN 1 NaN NaN 2
# 1 NaN 2 2 1 2 NaN 1 NaN NaN NaN
# 1 6 NaN 6 NaN 4 4 0 NaN NaN NaN
# 1 NaN 9 NaN 9 NaN 7 1 9 NaN NaN
Then it could be replaced with
np.random.seed(2015)
data = []
for i in range(5):
data.append(dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)
In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)
once at the end, outside the loop.
Each call to df.append
requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append
in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.
Python Panda append dataframe in loop
As I mentioned in my comment, appending to pandas dataframes is not considered a very good approach. Instead, I suggest that you use something more appropriate to store the data, such as a file or a database if you want scalability.
Then you can use pandas for what it's built, i.e. data analysis by just reading the contents of the database or the file into a dataframe.
Now, if you really want to stick with this approach, I suggest either join
or concat
to grow your dataframe as you get more data
[EDIT]
Example (from one of my scripts):
results = pd.DataFrame()
for result_file in result_files:
df = parse_results(result_file)
results = pd.concat([results, df], axis=0).reset_index(drop=True)
parse_results
is a function that takes a filename and returns a dataframe formatted in the right way, up to you to make it fit your needs.
Append strings via dataframes in for loop
The issue that's causing undesired output is coming from the fact that you're attempting to append a full Pandas.DataFrame
to a Pandas.Series
with the statement
df=Top['A'].append(d['-OE1x-'])
If you change this line to:
df = Top.append(d['-OE1x-'])
df
will look like this:
A
0 Hello
1 World
0 Appended Item-1x
You may want to pass ignore_index=True
as an argument to your call to Pandas.DataFrame.append()
so that the row containing Appended Item-1x
is given a sequential index - i.e the original index of 0
is not included as this would result in two rows with index 0
in df
.
e.g.
df = Top.append(d['-OE1x-'], ignore_index=True)
Will give you the following df
:
A
0 Hello
1 World
2 Appended Item-1x
Alternative Solution
Since it seems that you don't actually make use of each Pandas.DataFrame
in d
outside of appending them as new rows to existing dataframes, it may be a good idea to refactor your code so that each entry in d
looks like str: str
instead of str: Pandas.DataFrame
. Using your original code you could achieve this as follows:
import pandas as pd
Top = pd.DataFrame({'A':['Hello', 'World']})
Frst = ['1','2']
Scnd = ['x','y']
d = { f'-OE{num1}{num2}-': f'Appended Item-{num1}{num2}' for num1 in Frst for num2 in Scnd }
df = Top.append({'A': d['-OE1x-']}, ignore_index=True)
This will also provide the desired df
:
A
0 Hello
1 World
2 Appended Item-1x
However unlike the previous answer provided (and your original code) it will be much less memory intensive as d
is not being filled unnecessarily with instances of Pandas.DataFrame
.
Using Pandas Dataframe Append in For Loop
You are missing one line I think:
assetlist = list(df['Asset'].unique())
newdf = pd.DataFrame() # <-- define it as a data frame
for asset in assetlist:
df_subset = df[df['Asset'] == asset]
dfcopy = df_subset.copy()
newdf = newdf.append(dfcopy)
print(newdf)
date Asset Monthly Value
0 2019-01-01 Asset A 2100
1 2019-01-01 Asset A 8100
2 2019-01-01 Asset A 1400
3 2019-02-01 Asset B 1400
4 2019-02-01 Asset B 3100
5 2019-02-01 Asset B 1600
6 2019-03-01 Asset C 2400
7 2019-03-01 Asset C 2100
8 2019-03-01 Asset C 2100
However, an easier way to do this is:
newdf = pd.concat([df.query("Asset == @asset") for asset in assetlist])
Appending Dataframe Inside Loop
You need to create your DataFrame outside the loop. Then each time you create a new DataFrame in the loop you append it to the main one:
df = pd.DataFrame()
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_class_name('view-detail-link')]
for link in links:
df_links = pd.DataFrame([[link]], columns=['link'])
df = df.append(df_links)
try:
NextPage = driver.find_element_by_xpath('//a[@class="ui-pagination-next ui-goto-page"]')
driver.execute_script("arguments[0].click();", NextPage)
time.sleep(3)
except NoSuchElementException:
break
print(df.link[0])
Appending pandas Dataframe to each other inside loop
A general approach is to create a list
of dataframes, then pd.concat
them once at the end. Something like this:
dfs = []
tables = driver.find_elements_by_tag_name('table')
table = tables[1].get_attribute('outerHTML')
df_i = pd.read_html(table)
df = pd.concat(df_i)
dfs.append(df)
while True:
try:
driver.find_element_by_xpath('//a[@title="Next Page"]').click()
time.sleep(3)
tables = driver.find_elements_by_tag_name('table')
table = tables[1].get_attribute('outerHTML')
df_x = pd.read_html(table)
df = pd.concat(df_x)
dfs.append(df)
except:
break
df = pd.concat(dfs)
df.to_excel(f'Handloom/Handloom_{str(lofi)}.xlsx')
Append dataframe columns in a loop to yield a single dataframe
You can use pandas.concat()
to concat list of dataframes on columns with axis
set to 1.
dfs = []
for value in values:
sorted_tmp_df = highest_value_sorter(value)
sorted_tmp_df = sorted_tmp_df.drop(columns=['index'])
dfs.append(sorted_tmp_df)
df_ = pd.concat(dfs, axis=1)
Related Topics
What Exactly Is File.Flush() Doing
How to Split a Dos Path into Its Components in Python
What Does a . in an Import Statement in Python Mean
How to Ignore Deprecation Warnings in Python
Python Matplotlib Multiple Bars
Scatter Plot and Color Mapping in Python
How to Export Keras .H5 to Tensorflow .Pb
Reading from a Frequently Updated File
Python: Importing a Sub‑Package or Sub‑Module
How to Get the Ip Address from a Nic (Network Interface Controller) in Python
Obtain Active Window Using Python
Programmatically Saving Image to Django Imagefield
Windows Cmd Encoding Change Causes Python Crash
How to Debug in Django, the Good Way
How to Manually Create a Legend