How to Open Excel File Fast in Python

How to open excel file fast in Python?

pyExcelerator appears not to be maintained. To write xls files, use xlwt, which is a fork of pyExcelerator with bug fixes and many enhancements. The (very basic) xls reading capability of pyExcelerator was eradicated from xlwt. To read xls files, use xlrd.

If it's taking 20 minutes to load a 100MB xls file, you must be using one or more of: a slow computer, a computer with very little available memory, or an older version of Python.

Neither pyExcelerator nor xlrd read password-protected files.

Here's a link that covers xlrd and xlwt.

Disclaimer: I'm the author of xlrd and maintainer of xlwt.

Python make reading a Excel file faster

For such small workbooks there is no need to use read-only mode and by using it injudiciously you are causing the problem yourself. Every call to ws.cell() will force openpyxl to parse the worksheet again.

So, either you stop using read-only mode, or use ws.iter_rows() as I advised on your previous question.

In general, if you think something is running slow you should always profile it rather than just trying somethng out and hoping for the best.

i want to make my excel file read and write fast with openpyxl

I think the biggest issue while it is very slow is that you are loading + saving the file for each row contained. If you put the load + save outside of the loop the code should be much faster.

import openpyxl

book = openpyxl.load_workbook('semsar_full.xlsx')
sheet = book.active
row = 1
counter = 0
while row <= 20980:

a3 = sheet.cell(row=row, column=1)
a4 = a3.value + ':::'
sheet.cell(row=row, column=1, value=a4)
counter += 1
row += 1
print(counter)

book.save('semsar_full.xlsx')

Loading first 100 rows of excel

Cause

pandas uses the xlrd package under the hood for reading out excel files. The default behaviour of xlrd seems to be to load the entire excel workbook into memory, regardless of what data is read out in the end. This would explain why you're noticing no reduction in loading time when you're using the nrows parameter of pd.read_excel().

xlrd does offer the possibility to load worksheets on demand instead, but that won't be of much help unfortunately if all your data is in one single very large excel worksheet (plus it seems that this option doesn't support .xlsx files).

Solution

The excel parsing package openpyxl does offer the possibility to load individual excel rows on demand (i.e. only the needed excel rows are loaded into memory). With a little bit of custom code, openpyxl can be harnessed to retrieve your excel data as a pandas dataframe:

import openpyxl
import pandas as pd


def read_excel(filename, nrows):
"""Read out a subset of rows from the first worksheet of an excel workbook.

This function will not load more excel rows than necessary into memory, and is
therefore well suited for very large excel files.

Parameters
----------
filename : str or file-like object
Path to excel file.
nrows : int
Number of rows to parse (starting at the top).

Returns
-------
pd.DataFrame
Column labels are constructed from the first row of the excel worksheet.

"""
# Parameter `read_only=True` leads to excel rows only being loaded as-needed
book = openpyxl.load_workbook(filename=filename, read_only=True, data_only=True)
first_sheet = book.worksheets[0]
rows_generator = first_sheet.values

header_row = next(rows_generator)
data_rows = [row for (_, row) in zip(range(nrows - 1), rows_generator)]
return pd.DataFrame(data_rows, columns=header_row)


# USAGE EXAMPLE
dframe = read_excel('very_large_workbook.xlsx', nrows=100)

Using this code to load the first 100 rows of a >100MB single-sheet excel workbook takes just <1sec on my machine, whereas doing the same with pd.read_excel(nrows=100) takes >2min.



Related Topics



Leave a reply



Submit