Pandas Open_Excel() Fails With Xlrd.Biffh.Xlrderror: Can't Find Workbook in Ole2 Compound Document

Pandas open_excel() fails with xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document

After a lot of searching, the only way I've found to do this is to open and save all the excel documents, which seems to 'strip' them of their OLE2 format. I automated the process with the following vbs script:

Dim objFSO, objFolder, objFile
Dim objExcel, objWB
Set objExcel = CreateObject("Excel.Application")
Set objFSO = CreateObject("scripting.filesystemobject")
MyFolder = "<PATH/TO/FILES"
Set objFolder = objfso.getfolder(myfolder)
For Each objFile In objfolder.Files
If Right(objFile.Name,4) = "<EXTENSION>" Then
Set objWB = objExcel.Workbooks.Open(objFile)
objWB.save
objWB.close
End If
Next
objExcel.Quit
Set objExcel = Nothing
Set objFSO = Nothing
Wscript.Echo "Done"

Make sure to change the path to the folder and extension.

Pandas suddenly cannot open Excel file (can't find workbook in OLE2 compound document

After months struggling with this error, I've learned that the concerned files are being edited using an older version of Microsoft Office (namely Office 2007, in this very case). Then I decided to implement a clumsy workaround solution:
Just open the files using a compatible Excel version, and save a copy in a different folder; then open the file using pandas read_excel function, it should open normally!
To automate this task I wrote a powershell script just to open the original file and save the copy. This script must be executed according to how often the data is updated:

$FileName = "\\path\to\the\source\file.xlsx"
$FileNameCopy = "\\path\to\the\copy\file.xlsx"

$xl = New-Object -comobject Excel.Application
# repeat this for every file concerned
$wb = $xl.Workbooks.open("$FileName",3)
$wb.SaveAs($FileNameCopy)
$wb.Close($False)

$xl.Quit()

Now I can have my data loaded normally again.

Pandas unable to open this Excel file

I will answer my own question. In one of the comments from ayhan, Excel-protected files cannot be read by xlrd. One solution is to remove the protection.

I need the command to unprotect an Excel file from python

Another solution to read the Excel-protected file is to use xlwings. I have verified that xlwings is able to read protected Excel files when the Excel file is opened.

pd.read_excel can't read xlsm file

Pandas does support xlsm files.

That error often happens when you are trying to access password protected Excel files, Here you have a workaround if that's your case:

https://davidhamann.de/2018/02/21/read-password-protected-excel-files-into-pandas-dataframe/

Unable to read xlsb file using pandas

After looking into the problem a bit more and referring to @Datanovice 's comment, it works for me if I update to pandas v1.0.
I am using ubuntu 16.04 which can automatically update my python to 3.5, not any further and pandas v1.0 is supported from python 3.6. Hence, even after updating with the latest versions, I was not able to run the code.
We can install python 3.6 and install pandas v1.0 for that.

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.6

Using pandas 3.6, we can simply pass the engine as pyxlsb to read_excel to read the file.

import pandas as pd
df3 = pd.read_excel('a.xlsb', engine = 'pyxlsb')

Reference to install python3.6 on Ubuntu 16.04: https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get



Related Topics



Leave a reply



Submit