Best/Fastest way to read 3k of sheets from an Excel and Upload them in a Pandas Dataframe
You can try passing nr_pages_workbook
directly to sheet_name
param in read_excel
, according to the docs it can be a list, and the return value will be a dict of dataframes. This way you can avoid the overhead of opening and reading the file in every cycle.
Or just simply omit the parameter, and read all sheets into a dict, and then concatenate from the dict:
data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx')
df = pd.concat([v for k,v in data.items()])
How to increase process speed using read_excel in pandas?
Read all worksheets without guessing
Use sheetname = None
argument to pd.read_excel
. This will read all worksheets into a dictionary of dataframes. For example:
dfs = pd.read_excel('file.xlsx', sheetname=None)
# access 'Sheet1' worksheet
res = dfs['Sheet1']
Limit number of rows or columns
You can use parse_cols
and skip_footer
arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheetname = None
.
For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.
df = pd.read_excel('file.xlsx', sheetname=None, parse_cols='A:C', skip_footer=80)
If you wish to apply worksheet-specific logic, you can do so by extracting sheetnames:
sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names
dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)
Improving performance
Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.
One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv
.
How to improve my append and read excel For loop in python
I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?
Related Topics
How to Create a for Loop That Goes Through All Diagonal Possibilities of a List
Why Does the Session Cookie Work When Serving from a Domain But Not When Using an Ip
How to Read a File Without Newlines
Formatting Datetimefield in Django
Calculate the Lcm of a List of Given Numbers in Python
Python: How to Read and Load an Excel File from Aws S3
How to Solve and Equation With Inputs in Python
How to Match a Newline Character in a Raw String
How to Convert Signed to Unsigned Integer in Python
How to Split an Integer into an Array of Digits
Formal and Actual Parameters in a Function in Python
Python-3: Why This Following Code Returns None in Print Statement
How to Get Python to Detect for No Input
How to Send Keys to a Game I Am Playing,Using Python
How to Tell Python to Convert Integers into Words
How to Convert an Integer to Time