Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own
Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel
can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic
to drill into it. It also has a column showing guessing from mimetypes
but that isn't as good. Do zoom into the image of the frame and see that some .xls
are not what the extension says. Also a .txt
is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4
How to read/parse an .xls file in Python (XML schema)
Try to use "Index" attribute instead of cell element index:
# add "ss" namespace declaration to the namespaces map
ns = {"doc": "urn:schemas-microsoft-com:office:spreadsheet", "ss": "urn:schemas-microsoft-com:office:spreadsheet"}
# in function call reference element "Cell" having an attribute "Index" with value "7"
getvalueofnode(node.find('doc:Cell[@ss:Index="7"]/doc:Data', ns))
Same approach could be used for other cells as well.
This code will try to find cell element with the given index attribute. If not found function getvalueofnode() will return None.
To get parentColDimension
following code could be used:
for parentColDimension in root.findall('.//doc:header/doc:parentColDimension', ns):
print(parentColDimension.get('label'))
parsing excel documents with python
You're best bet for parsing Excel files would be the xlrd library. The python-excel.org site has links and examples for xlrd and related python excel libraries, including a pdf document that has some good examples of using xlrd. Of course, there are also lots of related xlrd questions on StackOverflow that might be of use.
One caveat with the xlrd library is that it will only work with xls
(Excel 2003 and earlier versions of excel) file formats and not the more recent xlsx
file format. There is a newer library openpyxl for dealing with the xlsx
, but I have never used it.
UPDATE:
As per John's comment, the xlrd library now supports both xls
and xlsx
file formats.
Hope that helps.
Related Topics
How to Find the Last Occurrence of an Item in a Python List
Break // in X Axis of Matplotlib
Generating a List of Random Numbers, Summing to 1
Finding Median of List in Python
How to Pass a Default Argument Value of an Instance Member to a Method
Why am I Getting Attributeerror: Object Has No Attribute
Pyaudio Working, But Spits Out Error Messages Each Time
Converting String with Utc Offset to a Datetime Object
Differencebetween Slice Assignment That Slices the Whole List and Direct Assignment
What Does It Mean to "Call" a Function in Python
Zip Variable Empty After First Use
Including Non-Python Files with Setup.Py
Insert a Row to Pandas Dataframe
Imread Returns None, Violating Assertion !_Src.Empty() in Function 'Cvtcolor' Error
How to Build Multiple Submit Buttons Django Form