Possible to loop through excel files with differently named sheets, and import into a list?
another (fast) R-solution using the readxl
-package
l <- lapply( file.list, readxl::read_excel, sheet = 1 )
xlsx R: looping through list of files to check all sheet names; create a blank sheet if it does not exist
I'm old skool. I like a for loop. That's what happens when you come from other languages!
So I'd be skipping all that lapply which doesn't explain what is going on and doing
new_data = "" # a placeholder for the data you will insert in an new empty sheet
for (file %in% all_files) {
# open the file and get it's sheets
sheets = excel_sheets(file)
# check if the file has the sheet you want
if ( "sheet1" %in% sheets) {
# do nothing
} else {
# if not - create a sheet
xlsx::write.xlsx(new_data,
file, # You may need to add the path?
sheetName="sheet1",
append=TRUE)
} #if you are checking for all 3 sheets and adding any 3 missing, add another for loop?
The R purists will say this is slow and inefficient. My question would be how fast does it need to be? How readable does it need to be?
Python Pandas - loop through folder of .xlsx files, only add data from Excel tabs with xx.xx in the name using regex
You're really close indeed, you just have to filter the sheets names with re.match
. Loop through each Excel file, and for each file, open it and get the list of tab names (excel_file.sheet_names
) use re.match
with the expression you already defined to get only those tabs that match the desired pattern. Read the content of these sheets (sheet_name=valid_sheets
) adjusting headers and index as needed for you particular case, then, add the extracted content of each excel file to a list. Concatenate the list with pd.concat
and generate the new excel file.
import pandas as pd
import os
import re
# filenames
files = os.listdir()
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))
regex = r'[0-9][0-9]+\.[0-9][0-9]'
frame_list = []
# loop through each Excel file
for name in excel_names:
# open one excel file
excel_file = pd.ExcelFile(name, engine='openpyxl')
# get the list of tabs that have xx.xx in the string
valid_sheets = [tab for tab in excel_file.sheet_names if re.match(regex, tab)]
# read the content from that tab list
d = excel_file.parse(sheet_name=valid_sheets, header=0)
# add the content to the frame list
frame_list += list(d.values())
combined = pd.concat(frame_list)
combined.to_excel("combinedfiles.xlsx", header=False, index=False)
Iterate through excel files' sheets and append if sheet names share common part in Python
Try:
dfs = pd.read_excel('Downloads/WS_1.xlsx', sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx')
Update
import os, glob
import pandas as pd
files = glob.glob("Downloads/test_data/*.xlsx")
writer = pd.ExcelWriter('Downloads/test_data/Output_file.xlsx', engine='xlsxwriter')
excel_dict = {}
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
excel_dict.update(dfs)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(writer, index=False, sheet_name=f'{n}')
writer.save()
writer.close()
Loop in order to create several DataFrames for each sheet in an Excel file
You can make use of exec()
for this. exec()
function is used for the dynamic execution of Python program which can either be a string or object code.
You can use xlrd library to get the sheet names too. You can use pandas libary too for the sheet names(I didn't look around, there definitely might be a way of doing that).
import xlrd
filename='try.xlsx'
xls = xlrd.open_workbook(filename, on_demand=True)
sheet_names=xls.sheet_names()
print(sheet_names)
Output:
['see1', 'see2', 'Sheet3']
Now that you've got sheet names, you can now run loop over them and use exec to create dataframes of same name:
for name in sheet_names:
exec(f"{name}=pd.read_excel('{filename}', sheet_name='{name}')")
This creates dataframes with filenames as the see1, see2 and Sheet3.
print(see1)
Output:
Col1 COl2
0 1 2
1 2 3
2 3 4
3 4 4
Hope this is what you need.
NOTE: In case your sheet name is just numbers, then it won't be possible to name a variable as just a number, so you might have to assign it a new name.
So just for the OP's case, here's a solution:
for name in sheet_names:
if name.isdigit():
exec(f"Sheet_name{name}=pd.read_excel('{filename}', sheet_name='{name}')")
else:
exec(f"{name}=pd.read_excel('{filename}', sheet_name='{name}')")
So what this code will do is, if you have any sheet name which is just numeric, it will create the variable name as, Sheet_name{the numeric}.
So in my case, I had sheet names as: ['Sheet1', '245', 'Sheet3']
and I finally get the second variable as a dataframe as below:
print(Sheet_name245)
Output:
Col1 Col2
0 1 4
1 2 5
2 3 6
Hope this helps with your case.
NOTE2: The case where the sheet name has a decimal in it and not just integer as a number, then the above code will stop, since a
decimal can't be used in a variable name either. So here's a
workaround:
for name in sheet_names:
if name.isdigit():
exec(f"Sheet_name{name}=pd.read_excel('{filename}', sheet_name='{name}')")
elif '.' in name:
temp_name=name.replace('.', '_')
exec(f"Sheet_name{temp_name}=pd.read_excel('{filename}', sheet_name='{name}')")
else:
exec(f"{name}=pd.read_excel('{filename}', sheet_name='{name}')")
So now we will get filename for 245.63
as Sheet_name245_63
. I hope now your issue is resolved.
Loop through Excel sheets in Python
you can read all sheets by providing sheet_name=None
dict_of_frames = pd.read_excel(f, sheet_name=None)
full example:
all_sheets = []
for f in glob.glob(r'C:\Users\Sarah\Desktop\test\*.xlsx'):
all_sheets.extend(pd.read_excel(f, sheet_name=None).values())
data = pd.concat(all_sheets)
data.to_excel(r'C:\Users\Sarah\Desktop\test\appended.xlsx')
Related Topics
How to Remove Parentheses from a String
How to Remove Square Brackets from List in Python
How to Remove Strings Present in a List from a Column in Pandas
How to Clear Only Last One Line in Python Output Console
Removing Backslashes from a String in Python
How to Limit the User Input to Only Integers in Python
How to Kill a While Loop With a Keystroke
How to Determine If My Python Shell Is Executing in 32Bit or 64Bit
How to Make a Function Change Variables While in a While Loop
Why I Get Key Error Even Though Column Present in Pandas
How to Remove Nan from List Python/Numpy
How to Do a Conditional Count After Groupby on a Pandas Dataframe
Python: [Errno 10054] an Existing Connection Was Forcibly Closed by the Remote Host
How to Limit a Number to Be Within a Specified Range (Python)
Python: Fastest Way to Compare Arrays Elementwise
How to Count Occurrences of Key in List of Dictionaries
How to Run Two Python Scripts Simultaneously from a Master Script