How to open excel file fast in Python?
pyExcelerator appears not to be maintained. To write xls files, use xlwt, which is a fork of pyExcelerator with bug fixes and many enhancements. The (very basic) xls reading capability of pyExcelerator was eradicated from xlwt. To read xls files, use xlrd.
If it's taking 20 minutes to load a 100MB xls file, you must be using one or more of: a slow computer, a computer with very little available memory, or an older version of Python.
Neither pyExcelerator nor xlrd read password-protected files.
Here's a link that covers xlrd and xlwt.
Disclaimer: I'm the author of xlrd and maintainer of xlwt.
Python make reading a Excel file faster
For such small workbooks there is no need to use read-only mode and by using it injudiciously you are causing the problem yourself. Every call to ws.cell()
will force openpyxl to parse the worksheet again.
So, either you stop using read-only mode, or use ws.iter_rows()
as I advised on your previous question.
In general, if you think something is running slow you should always profile it rather than just trying somethng out and hoping for the best.
i want to make my excel file read and write fast with openpyxl
I think the biggest issue while it is very slow is that you are loading + saving the file for each row contained. If you put the load + save outside of the loop the code should be much faster.
import openpyxl
book = openpyxl.load_workbook('semsar_full.xlsx')
sheet = book.active
row = 1
counter = 0
while row <= 20980:
a3 = sheet.cell(row=row, column=1)
a4 = a3.value + ':::'
sheet.cell(row=row, column=1, value=a4)
counter += 1
row += 1
print(counter)
book.save('semsar_full.xlsx')
Loading first 100 rows of excel
Cause
pandas
uses the xlrd
package under the hood for reading out excel files. The default behaviour of xlrd
seems to be to load the entire excel workbook into memory, regardless of what data is read out in the end. This would explain why you're noticing no reduction in loading time when you're using the nrows
parameter of pd.read_excel()
.
xlrd
does offer the possibility to load worksheets on demand instead, but that won't be of much help unfortunately if all your data is in one single very large excel worksheet (plus it seems that this option doesn't support .xlsx
files).
Solution
The excel parsing package openpyxl
does offer the possibility to load individual excel rows on demand (i.e. only the needed excel rows are loaded into memory). With a little bit of custom code, openpyxl
can be harnessed to retrieve your excel data as a pandas dataframe:
import openpyxl
import pandas as pd
def read_excel(filename, nrows):
"""Read out a subset of rows from the first worksheet of an excel workbook.
This function will not load more excel rows than necessary into memory, and is
therefore well suited for very large excel files.
Parameters
----------
filename : str or file-like object
Path to excel file.
nrows : int
Number of rows to parse (starting at the top).
Returns
-------
pd.DataFrame
Column labels are constructed from the first row of the excel worksheet.
"""
# Parameter `read_only=True` leads to excel rows only being loaded as-needed
book = openpyxl.load_workbook(filename=filename, read_only=True, data_only=True)
first_sheet = book.worksheets[0]
rows_generator = first_sheet.values
header_row = next(rows_generator)
data_rows = [row for (_, row) in zip(range(nrows - 1), rows_generator)]
return pd.DataFrame(data_rows, columns=header_row)
# USAGE EXAMPLE
dframe = read_excel('very_large_workbook.xlsx', nrows=100)
Using this code to load the first 100 rows of a >100MB single-sheet excel workbook takes just <1sec on my machine, whereas doing the same with pd.read_excel(nrows=100)
takes >2min.
Related Topics
How to Print Numbers in a List That Are Less Than a Variable. Python
Testing Whether a String Has Repeated Characters
Count Unique Words in a Text File (Python)
Iterate Through a List by Skipping Every 5Th Element
Populating Pandas Columns Based on Values in Other Columns
Pandas To_Csv() Slow Saving Large Dataframe
Convert CSV File to Pipe Delimited File in Python
Open a Putty Window and Run Ssh Commands - Python
Navigating Through Pagination With Selenium in Python
Pandas Rank by Multiple Columns
Pandas: How to Return Rows Where a Column Has a Line Breaks/New Line ( \N ) in Its Cell
Python: How to Add Single Quotes to a Long List
How to Randomly Partition a List into N Nearly Equal Parts
Expression to Remove Url Links from Twitter Tweet
In Python, How to Check If Selenium Webdriver Has Quit or Not
How to Get All Days in Current Month
Converting Text File into Json in a Specific Format ( Python )