Import multiple excel files into python pandas and concatenate them into one dataframe
As mentioned in the comments, one error you are making is that you are looping over an empty list.
Here is how I would do it, using an example of having 5 identical Excel files that are appended one after another.
(1) Imports:
import os
import pandas as pd
(2) List files:
path = os.getcwd()
files = os.listdir(path)
files
Output:
['.DS_Store',
'.ipynb_checkpoints',
'.localized',
'Screen Shot 2013-12-28 at 7.15.45 PM.png',
'test1 2.xls',
'test1 3.xls',
'test1 4.xls',
'test1 5.xls',
'test1.xls',
'Untitled0.ipynb',
'Werewolf Modelling',
'~$Random Numbers.xlsx']
(3) Pick out 'xls' files:
files_xls = [f for f in files if f[-3:] == 'xls']
files_xls
Output:
['test1 2.xls', 'test1 3.xls', 'test1 4.xls', 'test1 5.xls', 'test1.xls']
(4) Initialize empty dataframe:
df = pd.DataFrame()
(5) Loop over list of files to append to empty dataframe:
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
(6) Enjoy your new dataframe. :-)
df
Output:
Result Sample
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
Import multiple excel files start with same name into pandas and concatenate them into one dataframe
use a glob method with pathlib
and then concat
using pandas and a list comprehension.
from pathlib import Path
import pandas as pd
src_files = Path('C:\\').glob('*Answer*.xlsx')
df = pd.concat([pd.read_excel(f, index_col=None, header=0) for f in src_files])
Concatenate files into one Dataframe while adding identifier for each file
If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number
, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.
In python, how to concatenate corresponding sheets in multiple excel files
I would iterate over each file, and then over each worksheet, adding each sheet to a different list based on the sheet name.
Then you'll have a structure like...
{
'sheet1': [df_file1_sheet1, df_file2_sheet1, df_file3_sheet1],
'sheet2': [df_file1_sheet2, df_file2_sheet2, df_file3_sheet2],
'sheet3': [df_file1_sheet3, df_file2_sheet3, df_file3_sheet3],
}
Then concatenate each list in to a single dataframe, them write the three dataframes to an excel file.
# This part is just your own code, I've added it here because you
# couldn't figure out where `excel_files` came from
#################################################################
import os
import pandas as pd
path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files
# This part is my actual answer
###############################
from collections import defaultdict
worksheet_lists = defaultdict(list)
for file_name in excel_files:
workbook = pd.ExcelFile(file_name)
for sheet_name in workbook.sheet_names:
worksheet = workbook.parse(sheet_name)
worksheet['source'] = file_name
worksheet_lists[sheet_name].append(worksheet)
worksheets = {
sheet_name: pd.concat(sheet_list)
for (sheet_name, sheet_list)
in worksheet_lists.items()
}
writer = pd.ExcelWriter('family_reschedule.xlsx')
for sheet_name, df in worksheets.items():
df.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
Import multiple excel sheets from different files into python and concatenate them into one dataframe
IIUC, you can just do a list comp over your directory.
If using Python 3.4 +
from Pathlib import Path
path_ = 'c:\Users\Documents\Files'
dfs = [pd.read_excel(f,sheet_name='metrics') for f in Path(path_).glob('RMP*WE*')]
df = pd.concat(dfs)
or if you can only use the os
module :
os.chdir('c:\Users\Documents\Files')
files = glob.glob('RMP*WE*')
dfs = [pd.read_excel(f,sheet_name='metrics') for f in files]
df = pd.concat(dfs)
Update.
If you need to handle missing sheets, this would be a nice way to do so.
def exlude_sheet(excel_list, sheet):
"""
takes two arguments:
1. A list of excel documents
2. The name of your sheet.
3. Returns a single data frame after
working through your list of excel objects.
"""
from xlrd import XLRDError
df_lists = []
for file in excel_list:
try:
file_df = pd.read_excel(file, sheet_name=sheet)
df_lists.append(file_df)
except (XLRDError) as e:
print(f"{e} skipping")
continue
try:
return pd.concat(df_lists)
except ValueError as err:
print("No Objects Matched")
Test.
xlsx = [f for f in Path(path_).glob('RMP*WE*')]
df = exlude_sheet(xlsx,sheet='Metrics')
out:
No sheet named <'Metrics'> for doc_1 skipping
No sheet named <'Metrics'> for doc_final skipping
print(df)
Column_A data
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Test 2
Testing when no matching sheets are found at all :
exlude_sheet(xlsx,'foobar')
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No Objects Matched
How do I merge multiple xls files into one dataframe in python?
Here's how to do it. I used two functions. The first function reads all sheets within a single Excel file, and adds the sheet name. The second function takes all of the excel files, and uses the first function to read all the sheets in all the files.
from pandas import pd
def read_sheets(filename):
result = []
sheets = pd.read_excel(filename, sheet_name=None)
for name, sheet in sheets.items():
sheet['Sheetname'] = name
sheet['Row'] = sheet.index
result.append(sheet)
return pd.concat(result, ignore_index=True)
def read_files(filenames):
result = []
for filename in filenames:
file = read_sheets(filename)
file['Filename'] = filename
result.append(file)
return pd.concat(result, ignore_index=True)
You can call this by providing a list of files to read:
files = ['multisheet.xls', 'multisheet2.xls']
read_files(files)
For the example I tried it on, it produces a dataframe like this:
A B A+B Sheetname Row Filename
0 1 10 11 Sheet1 0 multisheet.xls
1 2 11 13 Sheet1 1 multisheet.xls
2 3 12 15 Sheet1 2 multisheet.xls
3 4 13 17 Sheet1 3 multisheet.xls
4 3 10 13 Sheet2 0 multisheet.xls
5 3 11 14 Sheet2 1 multisheet.xls
6 3 12 15 Sheet2 2 multisheet.xls
7 3 13 16 Sheet2 3 multisheet.xls
8 1 10 11 Sheet1 0 multisheet2.xls
9 2 11 13 Sheet1 1 multisheet2.xls
10 3 12 15 Sheet1 2 multisheet2.xls
11 4 13 17 Sheet1 3 multisheet2.xls
12 4 10 13 Sheet2 0 multisheet2.xls
13 3 11 14 Sheet2 1 multisheet2.xls
14 3 12 15 Sheet2 2 multisheet2.xls
15 3 13 16 Sheet2 3 multisheet2.xls
How to run a loop to concatenate columns of multiple excel files(as separate dataframes) in a folder and merge and export into final dataframe
Create empty list before loop and then use append
for create list of DataFrame
s:
filenames= glob.glob(r'C:\Desktop\*.xlsx')
final = []
for idx, fname in enumerate(filenames):
df2=pd.read_excel(fname,sheet_name="PI",skiprows=4)
df2[["Combin"]=df2.Pcode.str.cat(df2.Icode)
merged=df.merge(df2,left_on='Combin', right_on='Combin', how='inner')
df3=pd.read_excel(fname,sheet_name='PI')
exc=df3.iat[0,19]
merged ['Exchange']=exc
final.append(merged)
excel_merged=pd.concat(final, ignore_index=True)
excel_merged.to_excel('output.xlsx')
Merge excel files with multiple sheets into one dataframe
pd.concat()
has an ignore_index
parameter, which you will need if your rows have differing indices across the individual frames
. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.
Try:
pd.concat(frames, axis=1, ignore_index=True)
In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])
In [6]: df1
Out[6]:
A B
0 2 3
1 2 3
In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])
In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)
In [11]: df
Out[11]:
0 1 2 3
0 2 3 22 33
1 2 3 22 33
In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)
In [13]: df
Out[13]:
A B AAA BBB
0 2 3 22 33
1 2 3 22 33
Related Topics
Python: Download Files from Google Drive Using Url
Basic Python Client Socket Example
In Python, How to Index a List with Another List
Fastest Way to Convert a Dict's Keys & Values from 'Unicode' to 'Str'
What Is the Most Efficient Way to Get First and Last Line of a Text File
How to Select Python Version in Pycharm
Extract Number from String in Python
In Django - Model Inheritance - Does It Allow You to Override a Parent Model's Attribute
How to Force Django to Ignore Any Caches and Reload Data
Is There a Generator Version of 'String.Split()' in Python
Generate All Permutations of a List Without Adjacent Equal Elements
Is It Feasible to Compile Python to MAChine Code