Faster Way to Read Excel Files to Pandas Dataframe

Best/Fastest way to read 3k of sheets from an Excel and Upload them in a Pandas Dataframe

You can try passing nr_pages_workbook directly to sheet_name param in read_excel, according to the docs it can be a list, and the return value will be a dict of dataframes. This way you can avoid the overhead of opening and reading the file in every cycle.

Or just simply omit the parameter, and read all sheets into a dict, and then concatenate from the dict:

data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx')
df = pd.concat([v for k,v in data.items()])

How to increase process speed using read_excel in pandas?

Read all worksheets without guessing

Use sheetname = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:

dfs = pd.read_excel('file.xlsx', sheetname=None)

# access 'Sheet1' worksheet
res = dfs['Sheet1']

Limit number of rows or columns

You can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheetname = None.

For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.

df = pd.read_excel('file.xlsx', sheetname=None, parse_cols='A:C', skip_footer=80)

If you wish to apply worksheet-specific logic, you can do so by extracting sheetnames:

sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names

dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)

Improving performance

Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.

One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.

How to improve my append and read excel For loop in python

I've found a solution with xlsx2csv

xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')


for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)

outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv

listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))

bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)

I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?



Related Topics



Leave a reply



Submit