Read merged cells in Excel with Python
I just tried this and it seems to work for your sample data:
all_data = []
excel = xlrd.open_workbook(excel_dir+ excel_file)
sheet_0 = excel.sheet_by_index(0) # Open the first tab
prev_row = [None for i in range(sheet_0.ncols)]
for row_index in range(sheet_0.nrows):
row= []
for col_index in range(sheet_0.ncols):
value = sheet_0.cell(rowx=row_index,colx=col_index).value
if len(value) == 0:
value = prev_row[col_index]
row.append(value)
prev_row = row
all_data.append(row)
returning
[['2', '0', '30'], ['2', '1', '20'], ['2', '5', '52']]
It keeps track of the values from the previous row and uses them if the corresponding value from the current row is empty.
Note that the above code does not check if a given cell is actually part of a merged set of cells, so it could possibly duplicate previous values in cases where the cell should really be empty. Still, it might be of some help.
Additional information:
I subsequently found a documentation page that talks about a merged_cells
attribute that one can use to determine the cells that are included in various ranges of merged cells. The documentation says that it is "New in version 0.6.1", but when i tried to use it with xlrd-0.9.3 as installed by pip
I got the error
NotImplementedError: formatting_info=True not yet implemented
I'm not particularly inclined to start chasing down different versions of xlrd to test the merged_cells
feature, but perhaps you might be interested in doing so if the above code is insufficient for your needs and you encounter the same error that I did with formatting_info=True
.
How to read merged cells in python using openpyxl?
I amend my code and it's work.
import openpyxl
from openpyxl.utils import range_boundaries
wb = openpyxl.load_workbook('book1.xlsx')
sheet = wb.get_sheet_by_name('info')
all_data=[]
for row_index in range(1,sheet.max_row+1):
row=[]
for col_index in range(1,sheet.max_column+1):
vals = sheet.cell(row_index,col_index).value
if vals == None:
for crange in sheet.merged_cells:
clo,rlo,chi,rhi = crange.bounds
top_value = sheet.cell(rlo,clo).value
if rlo<=row_index and row_index<=rhi and clo<=col_index and col_index<=chi:
vals = top_value
print(vals)
break
row.append(vals)
all_data.append(row)
print(all_data)
for row in all_data:
sheet.append(row)
wb.save('bbbb.xlsx')
Pandas: Parse Excel spreadsheet with merged cells and blank values
Something like this should work assuming you know the starting row of your excel file (or come up with a better way to check that)
import pandas as pd
import numpy as np
import openpyxl
def test():
filepath = "C:\\Users\\me\\Desktop\\SO nonsense\\PandasMergeCellTest.xlsx"
df = pd.read_excel(filepath)
wb = openpyxl.load_workbook(filepath)
sheet = wb["Sheet1"]
df["Row"] = np.arange(len(df)) + 2 #My headers were row 1 so adding 2 to get the row numbers
df["Merged"] = df.apply(lambda x: checkMerged(x, sheet), axis=1)
df["Day"] = np.where(df["Merged"] == True, df["Day"].ffill(), np.nan)
df = df.drop(["Row", "Merged"], 1)
print(df)
def checkMerged(x, sheet):
cell = sheet.cell(x["Row"], 1)
for mergedcell in sheet.merged_cells.ranges:
if(cell.coordinate in mergedcell):
return True
test()
Pandas merged cell issue when reading from excel
Managed to find a fix
def read_excel(path):
excel = None
if path.endswith('xlsx'):
excel = pd.ExcelFile(xlrd.open_workbook(path), engine='xlrd')
elif path.endswith('xls'):
excel = pd.ExcelFile(xlrd.open_workbook(path, formatting_info=True), engine='xlrd')
else:
raise ValueError("Could not read this type of data")
return excel
def parse_excel(excel_file):
sheet_0 = excel_file.book.sheet_by_index(0)
df = excel_file.parse(0, header=None)
return sheet_0, df
def fill_merged_na(sheet, dataframe):
for e in sheet.merged_cells:
rl, rh, cl, ch = e
base_value = sheet.cell_value(rl, cl)
dataframe.iloc[rl:rh, cl:ch] = base_value
return dataframe
Some of the important bits are opening the excel file with the formatting_info set to True in order to also read formatting such as merged cells and the fill_merged_na function that fills only the merged nan values but leaves the initial empty cells as they were.
How to read merged excel column in python?
df = pd.read_excel(r'C:/Users/USER1/Desktop/report.xlsx')
df = df.reset_index()
df = df.drop(labels = df.filter(regex = 'Unnamed').columns, axis = 1)
df
Related Topics
Use Tqdm Progress Bar With Pandas
Pythonically Add Header to a CSV File
Importing Large Tab-Delimited .Txt File into Python
Matplotlib Bar Chart: Space Out Bars
How to Move to One Folder Back in Python
Numpy: Checking If a Value Is Nat
Get Character Position in Alphabet
How to Vectorize (Make Use of Pandas/Numpy) Instead of Using a Nested for Loop
Get the Mean Across Multiple Pandas Dataframes
How to Extract Address from Raw Text Using Nltk in Python
Find All CSV Files in a Directory Using Python
Read Merged Cells in Excel With Python
Split String At Nth Occurrence of a Given Character
Windowserror: [Error 193] %1 Is Not a Valid Win32 Application in Python
How to Print Specific Key Value from a Dictionary
How Can Draw a Line Using the X and Y Coordinates of Two Points
Cursor.Fetchone() Returns None But Row in the Database Exists