Appending pandas dataframes generated in a for loop
Use pd.concat
to merge a list of DataFrame into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
Append data frames together in a for loop
Don't do it inside the loop. Make a list, then combine them outside the loop.
datalist = list()
for (i in 1:5) {
# ... make some data
dat <- data.frame(x = rnorm(10), y = runif(10))
dat$i <- i # maybe you want to keep track of which iteration produced it?
datalist[[i]] <- dat # add it to your list
}
big_data = do.call(rbind, datalist)
# or big_data <- dplyr::bind_rows(datalist)
# or big_data <- data.table::rbindlist(datalist)
This is a much more R-like way to do things. It can also be substantially faster, especially if you use dplyr::bind_rows
or data.table::rbindlist
for the final combining of data frames.
Python Panda append dataframe in loop
As I mentioned in my comment, appending to pandas dataframes is not considered a very good approach. Instead, I suggest that you use something more appropriate to store the data, such as a file or a database if you want scalability.
Then you can use pandas for what it's built, i.e. data analysis by just reading the contents of the database or the file into a dataframe.
Now, if you really want to stick with this approach, I suggest either join
or concat
to grow your dataframe as you get more data
[EDIT]
Example (from one of my scripts):
results = pd.DataFrame()
for result_file in result_files:
df = parse_results(result_file)
results = pd.concat([results, df], axis=0).reset_index(drop=True)
parse_results
is a function that takes a filename and returns a dataframe formatted in the right way, up to you to make it fit your needs.
How to append rows in a pandas dataframe in a for loop?
Suppose your data looks like this:
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
data = dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5)))
data = pd.DataFrame(data.items())
data = data.transpose()
data.columns = data.iloc[0]
data = data.drop(data.index[[0]])
df = df.append(data)
print('{}\n'.format(df))
# 0 0 1 2 3 4 5 6 7 8 9
# 1 6 NaN NaN 8 5 NaN NaN 7 0 NaN
# 1 NaN 9 6 NaN 2 NaN 1 NaN NaN 2
# 1 NaN 2 2 1 2 NaN 1 NaN NaN NaN
# 1 6 NaN 6 NaN 4 4 0 NaN NaN NaN
# 1 NaN 9 NaN 9 NaN 7 1 9 NaN NaN
Then it could be replaced with
np.random.seed(2015)
data = []
for i in range(5):
data.append(dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)
In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)
once at the end, outside the loop.
Each call to df.append
requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append
in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.
Using pandas .append within for loop
You need to set the the variable data
equal to the appended data frame. Unlike the append
method on a python list the pandas append
does not happen in place
import pandas as pd
import numpy as np
data = pd.DataFrame([])
for i in np.arange(0, 4):
if i % 2 == 0:
data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
print(data.head())
A B
0 0 1.0
1 2 3.0
2 3 NaN
NOTE: This answer aims to answer the question as it was posed. It is not however the optimal strategy for combining large numbers of dataframes. For a more optimal solution have a look at Alexander's answer below
Append Dataframes together in for loop
Do not use pd.DataFrame.append
in a loop
This is inefficient as it involves copying data repeatedly. A much better idea is to create a list of dataframes and then concatenate them at the end in a final step outside your loop. Here's some pseudo-code:
symbols = ['WYNN', 'FL', 'TTWO']
cols = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume']
dfs = [] # empty list which will hold your dataframes
for c in range(len(symbols)):
# some code
df = pd.DataFrame(stock_data, columns=cols)
df = df.set_index('Date')
df['Volume'] = df['Volume'].str.replace(',', '').astype(int)
df[cols[0]] = pd.to_datetime(df[cols[0]], errors='coerce')
df[cols[1:5]] = df[cols[1:5]].apply(pd.to_datetime, errors='coerce')
dfs.append(df) # append dataframe to list
res = pd.concat(dfs, ignore_index=True) # concatenate list of dataframes
res.to_excel('stock data.xlsx', index=False)
Note you are performing many operations, e.g. set_index
, as if they are by default in place. That's not the case. You should assign back to a variable, e.g. df = df.set_index('Date')
.
Related Topics
Create and Assign Multiple New Dataframe Columns in Ifelse Statement
Calculate the Area Under a Curve
Why Are These Numbers Not Equal
Reshape Three Column Data Frame to Matrix ("Long" to "Wide" Format)
How to Install an R Package from Source
Finding Local Maxima and Minima
Explicitly Calling Return in a Function or Not
How to Drop Columns by Name in a Data Frame
Replacing Character Values With Na in a Data Frame
How to Select Variables in an R Dataframe Whose Names Contain a Particular String
Change R Default Library Path Using .Libpaths in Rprofile.Site Fails to Work
How to Sum a Variable by Group
Finding All Duplicate Rows, Including "Elements With Smaller Subscripts"
Test If a Vector Contains a Given Element
Interpreting "Condition Has Length ≫ 1" Warning from 'If' Function