Pandas open_excel() fails with xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
After a lot of searching, the only way I've found to do this is to open and save all the excel documents, which seems to 'strip' them of their OLE2 format. I automated the process with the following vbs script:
Dim objFSO, objFolder, objFile
Dim objExcel, objWB
Set objExcel = CreateObject("Excel.Application")
Set objFSO = CreateObject("scripting.filesystemobject")
MyFolder = "<PATH/TO/FILES"
Set objFolder = objfso.getfolder(myfolder)
For Each objFile In objfolder.Files
If Right(objFile.Name,4) = "<EXTENSION>" Then
Set objWB = objExcel.Workbooks.Open(objFile)
objWB.save
objWB.close
End If
Next
objExcel.Quit
Set objExcel = Nothing
Set objFSO = Nothing
Wscript.Echo "Done"
Make sure to change the path to the folder and extension.
Pandas suddenly cannot open Excel file (can't find workbook in OLE2 compound document
After months struggling with this error, I've learned that the concerned files are being edited using an older version of Microsoft Office (namely Office 2007, in this very case). Then I decided to implement a clumsy workaround solution:
Just open the files using a compatible Excel version, and save a copy in a different folder; then open the file using pandas read_excel function, it should open normally!
To automate this task I wrote a powershell script just to open the original file and save the copy. This script must be executed according to how often the data is updated:
$FileName = "\\path\to\the\source\file.xlsx"
$FileNameCopy = "\\path\to\the\copy\file.xlsx"
$xl = New-Object -comobject Excel.Application
# repeat this for every file concerned
$wb = $xl.Workbooks.open("$FileName",3)
$wb.SaveAs($FileNameCopy)
$wb.Close($False)
$xl.Quit()
Now I can have my data loaded normally again.
Pandas unable to open this Excel file
I will answer my own question. In one of the comments from ayhan, Excel-protected files cannot be read by xlrd. One solution is to remove the protection.
I need the command to unprotect an Excel file from python
Another solution to read the Excel-protected file is to use xlwings. I have verified that xlwings is able to read protected Excel files when the Excel file is opened.
pd.read_excel can't read xlsm file
Pandas does support xlsm files.
That error often happens when you are trying to access password protected Excel files, Here you have a workaround if that's your case:
https://davidhamann.de/2018/02/21/read-password-protected-excel-files-into-pandas-dataframe/
Unable to read xlsb file using pandas
After looking into the problem a bit more and referring to @Datanovice 's comment, it works for me if I update to pandas v1.0.
I am using ubuntu 16.04 which can automatically update my python to 3.5, not any further and pandas v1.0 is supported from python 3.6. Hence, even after updating with the latest versions, I was not able to run the code.
We can install python 3.6 and install pandas v1.0 for that.
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.6
Using pandas 3.6, we can simply pass the engine as pyxlsb to read_excel to read the file.
import pandas as pd
df3 = pd.read_excel('a.xlsb', engine = 'pyxlsb')
Reference to install python3.6 on Ubuntu 16.04: https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get
Related Topics
How to Update/Delete Rows in Bigquery from the Python API
Getting S3 Objects' Last Modified Datetimes With Boto
How to Properly Setup Pipenv in Pycharm
How to Show a Pandas Dataframe into a Existing Flask HTML Table
Split a Large Json File into Multiple Smaller Files
Using Buttons in Tkinter to Navigate to Different Pages of the Application
How to Call Python Script on Excel Vba
Pickle - Cpickle.Unpicklingerror: Invalid Load Key, '?'
Python Replace Elements in Array At Certain Range
Typeerror: Unsupported Operand Type(S) for ** or Pow(): 'List' and 'Int'
How to Count Duplicate Rows in Pandas Dataframe
Importerror: No Module Named Psycopg2 After Install
Permission Check Discord.Py Bot
Check Type: How to Check If Something Is a Rdd or a Dataframe