Import Multiple Excel Files into Python Pandas and Concatenate Them into One Dataframe

Import multiple excel files into python pandas and concatenate them into one dataframe

As mentioned in the comments, one error you are making is that you are looping over an empty list.

Here is how I would do it, using an example of having 5 identical Excel files that are appended one after another.

(1) Imports:

import os
import pandas as pd

(2) List files:

path = os.getcwd()
files = os.listdir(path)
files

Output:

['.DS_Store',
'.ipynb_checkpoints',
'.localized',
'Screen Shot 2013-12-28 at 7.15.45 PM.png',
'test1 2.xls',
'test1 3.xls',
'test1 4.xls',
'test1 5.xls',
'test1.xls',
'Untitled0.ipynb',
'Werewolf Modelling',
'~$Random Numbers.xlsx']

(3) Pick out 'xls' files:

files_xls = [f for f in files if f[-3:] == 'xls']
files_xls

Output:

['test1 2.xls', 'test1 3.xls', 'test1 4.xls', 'test1 5.xls', 'test1.xls']

(4) Initialize empty dataframe:

df = pd.DataFrame()

(5) Loop over list of files to append to empty dataframe:

for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)

(6) Enjoy your new dataframe. :-)

df

Output:

  Result  Sample
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10

Import multiple excel files start with same name into pandas and concatenate them into one dataframe

use a glob method with pathlib and then concat using pandas and a list comprehension.

from pathlib import Path
import pandas as pd

src_files = Path('C:\\').glob('*Answer*.xlsx')

df = pd.concat([pd.read_excel(f, index_col=None, header=0) for f in src_files])

Concatenate files into one Dataframe while adding identifier for each file

If I understand you correctly, it's simple:

import re # <-------------- Add this line

path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")

li = []

for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.

In python, how to concatenate corresponding sheets in multiple excel files

I would iterate over each file, and then over each worksheet, adding each sheet to a different list based on the sheet name.

Then you'll have a structure like...

{
'sheet1': [df_file1_sheet1, df_file2_sheet1, df_file3_sheet1],
'sheet2': [df_file1_sheet2, df_file2_sheet2, df_file3_sheet2],
'sheet3': [df_file1_sheet3, df_file2_sheet3, df_file3_sheet3],
}

Then concatenate each list in to a single dataframe, them write the three dataframes to an excel file.

# This part is just your own code, I've added it here because you
# couldn't figure out where `excel_files` came from
#################################################################

import os
import pandas as pd

path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files

# This part is my actual answer
###############################

from collections import defaultdict

worksheet_lists = defaultdict(list)
for file_name in excel_files:
workbook = pd.ExcelFile(file_name)
for sheet_name in workbook.sheet_names:
worksheet = workbook.parse(sheet_name)
worksheet['source'] = file_name
worksheet_lists[sheet_name].append(worksheet)

worksheets = {
sheet_name: pd.concat(sheet_list)
for (sheet_name, sheet_list)
in worksheet_lists.items()
}

writer = pd.ExcelWriter('family_reschedule.xlsx')

for sheet_name, df in worksheets.items():
df.to_excel(writer, sheet_name=sheet_name, index=False)

writer.save()

Import multiple excel sheets from different files into python and concatenate them into one dataframe

IIUC, you can just do a list comp over your directory.

If using Python 3.4 +

from Pathlib import Path
path_ = 'c:\Users\Documents\Files'

dfs = [pd.read_excel(f,sheet_name='metrics') for f in Path(path_).glob('RMP*WE*')]

df = pd.concat(dfs)

or if you can only use the os module :

os.chdir('c:\Users\Documents\Files')
files = glob.glob('RMP*WE*')
dfs = [pd.read_excel(f,sheet_name='metrics') for f in files]
df = pd.concat(dfs)

Update.

If you need to handle missing sheets, this would be a nice way to do so.

def exlude_sheet(excel_list, sheet):
"""
takes two arguments:
1. A list of excel documents
2. The name of your sheet.
3. Returns a single data frame after
working through your list of excel objects.
"""
from xlrd import XLRDError
df_lists = []
for file in excel_list:
try:
file_df = pd.read_excel(file, sheet_name=sheet)
df_lists.append(file_df)
except (XLRDError) as e:
print(f"{e} skipping")
continue
try:
return pd.concat(df_lists)
except ValueError as err:
print("No Objects Matched")

Test.

xlsx = [f for f in Path(path_).glob('RMP*WE*')]
df = exlude_sheet(xlsx,sheet='Metrics')
out:
No sheet named <'Metrics'> for doc_1 skipping
No sheet named <'Metrics'> for doc_final skipping
print(df)
Column_A data
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4

Test 2

Testing when no matching sheets are found at all :

exlude_sheet(xlsx,'foobar')
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No Objects Matched

How do I merge multiple xls files into one dataframe in python?

Here's how to do it. I used two functions. The first function reads all sheets within a single Excel file, and adds the sheet name. The second function takes all of the excel files, and uses the first function to read all the sheets in all the files.

from pandas import pd
def read_sheets(filename):
result = []
sheets = pd.read_excel(filename, sheet_name=None)
for name, sheet in sheets.items():
sheet['Sheetname'] = name
sheet['Row'] = sheet.index
result.append(sheet)
return pd.concat(result, ignore_index=True)

def read_files(filenames):
result = []
for filename in filenames:
file = read_sheets(filename)
file['Filename'] = filename
result.append(file)
return pd.concat(result, ignore_index=True)

You can call this by providing a list of files to read:

files = ['multisheet.xls', 'multisheet2.xls']
read_files(files)

For the example I tried it on, it produces a dataframe like this:

    A   B  A+B Sheetname  Row         Filename
0 1 10 11 Sheet1 0 multisheet.xls
1 2 11 13 Sheet1 1 multisheet.xls
2 3 12 15 Sheet1 2 multisheet.xls
3 4 13 17 Sheet1 3 multisheet.xls
4 3 10 13 Sheet2 0 multisheet.xls
5 3 11 14 Sheet2 1 multisheet.xls
6 3 12 15 Sheet2 2 multisheet.xls
7 3 13 16 Sheet2 3 multisheet.xls
8 1 10 11 Sheet1 0 multisheet2.xls
9 2 11 13 Sheet1 1 multisheet2.xls
10 3 12 15 Sheet1 2 multisheet2.xls
11 4 13 17 Sheet1 3 multisheet2.xls
12 4 10 13 Sheet2 0 multisheet2.xls
13 3 11 14 Sheet2 1 multisheet2.xls
14 3 12 15 Sheet2 2 multisheet2.xls
15 3 13 16 Sheet2 3 multisheet2.xls

How to run a loop to concatenate columns of multiple excel files(as separate dataframes) in a folder and merge and export into final dataframe

Create empty list before loop and then use append for create list of DataFrames:

filenames= glob.glob(r'C:\Desktop\*.xlsx')

final = []
for idx, fname in enumerate(filenames):
df2=pd.read_excel(fname,sheet_name="PI",skiprows=4)
df2[["Combin"]=df2.Pcode.str.cat(df2.Icode)
merged=df.merge(df2,left_on='Combin', right_on='Combin', how='inner')
df3=pd.read_excel(fname,sheet_name='PI')
exc=df3.iat[0,19]
merged ['Exchange']=exc
final.append(merged)

excel_merged=pd.concat(final, ignore_index=True)
excel_merged.to_excel('output.xlsx')

Merge excel files with multiple sheets into one dataframe

pd.concat() has an ignore_index parameter, which you will need if your rows have differing indices across the individual frames. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.

Try:

pd.concat(frames, axis=1, ignore_index=True)
In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])

In [6]: df1
Out[6]:
A B
0 2 3
1 2 3

In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])

In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)

In [11]: df
Out[11]:
0 1 2 3
0 2 3 22 33
1 2 3 22 33

In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)

In [13]: df
Out[13]:
A B AAA BBB
0 2 3 22 33
1 2 3 22 33


Related Topics



Leave a reply



Submit