Faster Way to Read Excel Files to Pandas Dataframe

Best/Fastest way to read 3k of sheets from an Excel and Upload them in a Pandas Dataframe

You can try passing nr_pages_workbook directly to sheet_name param in read_excel, according to the docs it can be a list, and the return value will be a dict of dataframes. This way you can avoid the overhead of opening and reading the file in every cycle.

Or just simply omit the parameter, and read all sheets into a dict, and then concatenate from the dict:

data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx')
df = pd.concat([v for k,v in data.items()])

How to increase process speed using read_excel in pandas?

Read all worksheets without guessing

Use sheetname = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:

dfs = pd.read_excel('file.xlsx', sheetname=None)

# access 'Sheet1' worksheet
res = dfs['Sheet1']

Limit number of rows or columns

You can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheetname = None.

For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.

df = pd.read_excel('file.xlsx', sheetname=None, parse_cols='A:C', skip_footer=80)

If you wish to apply worksheet-specific logic, you can do so by extracting sheetnames:

sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names

dfs = {}
for sheet in sheet_names:
    dfs[sheet] = pd.read_excel('file.xlsx', sheet)

Improving performance

Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.

One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.

How to improve my append and read excel For loop in python

I've found a solution with xlsx2csv

xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')


for xlsx in list_of_xlsx:
    # Extract File Name on group 2 "(.+)"
    filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
    # Setup the call for subprocess.call()
    call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
    try:
        subprocess.call(call) # On Windows use shell=True
    except:
        print('Failed with {}'.format(filepath)

outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv

listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
    df = pd.read_csv(file)
    if df.shape[1] == 24: # make sure 24 columns
        listofdataframes.append(df)
    else:
        print('{}  has {} columns - skipping'.format(file,df.shape[1]))

bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)

I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?