Reading Excel File: How to Find the Start Cell in Messy Spreadsheets

Reading Excel file: How to find the start cell in messy spreadsheets?

I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.

For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.

Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.

library(readxl)
library(stringr)
library(dplyr)

f_path <- file.path("whatever.xlsx")

if (!file.exists(f_path)) {
  f_path <- file.choose()
}

# I read this twice, temp_read to figure out where the data actually starts...

# Maybe you need something like this - 
#   excel_sheets <- readxl::excel_sheets(f_path)
#   desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)

skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10

# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
  col_skip <- col_skip + 1
  if (col_skip == max_cols_to_search) break
  skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0

}

# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
  f_path,
  sheet = desired_sheet,
  skip = skip_rows
)

# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
#   select(-(1:col_skip))

How to read merged excel cells with R

If a VBA/R hybrid suits your purposes, here is a VBA macro which will unmerge all cells in a worksheet, while simultaneously filling all cells in the unmerged region with the corresponding value:

Sub UnMerge(ws As Worksheet)
    Dim R As Range, c As Range
    Dim v As Variant
    For Each c In ws.UsedRange
        If c.MergeCells Then
            v = c.Value
            Set R = c.MergeArea
            R.UnMerge
            R.Value = v
        End If
    Next c
End Sub

A simple test to show how it is called:

Sub test()
    UnMerge Sheets(1)
End Sub

The sub UnMerged can be used as part of a larger program that e.g. iterates over all .xlsx files in a folder and all data-containing sheets in the files, unmerging them all and saving them as .csv files.

On Edit. Native VBA file handling is somewhat annoying. I tend to use the related scripting language VBScript if I need to iterate over multiple files. I'm not sure if your virtual Windows can handle VBScript. I would assume so since VBScript is a standard part of the Windows OS. If this is the case, see if the following works (after backing up the files just to be safe). Save the code as a simple text file with a .vbs extension in the folder that contains the Excel files that you want to modify. Then, simply click its icon. It will iterate over all .xlx and .xlsx files in the directory that contains the script and unmerge sheet 1 in each such file. I didn't test it extensively and it contains no error-handling, but I did test it on a folder with three Excel files which each contained multiple merged regions and it ran as expected on my Windows machine. I don't know if it will work on your Mac:

Option Explicit

Dim fso,fol,f,xl, wb, ws,ext,v,r,c

Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Set xl = CreateObject("Excel.Application")
xl.DisplayAlerts = False
xl.ScreenUpdating = False
set fol = fso.GetFolder(fso.GetParentFolderName(WScript.ScriptFullName))

For Each f In fol.Files
    ext = LCase(fso.GetExtensionName(f.Name))
    If ext = "xls" Or ext = "xlsx" Then
        Set wb = xl.Workbooks.Open(f.Path)
        Set ws = wb.Sheets(1)
        For Each c In ws.UsedRange
            If c.MergeCells Then
                v = c.Value
                Set R = c.MergeArea
                R.UnMerge
                R.Value = v
            End If
        Next
        wb.Save
        wb.Close   
    End If
Next

Anyway to have Find method start at the very firs cell ie A1

Set the After parameter to the last cell of your range, see my answer here

Read online excel file with a specific sheet and only selected columns

Use:

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls'

df = pd.read_excel(url, sheet_name='Data', skipfooter=3)
df = df.drop(columns=df.filter(like='Unnamed').columns)
df.columns = df.iloc[0].to_list()
df = df[1:].reset_index(drop=True)

Output

       LB        AC        FM        UC        DL DS DP ASTV MSTV ALTV  MLTV Width  Min  Max Nmax Nzeros Mode Mean Median Variance Tendency CLASS NSP
0     120         0         0         0         0  0  0   73  0.5   43   2.4    64   62  126    2      0  120  137    121       73        1     9   2
1     132   0.00638         0   0.00638   0.00319  0  0   17  2.1    0  10.4   130   68  198    6      1  141  136    140       12        0     6   1
2     133  0.003322         0  0.008306  0.003322  0  0   16  2.1    0  13.4   130   68  198    5      1  141  135    138       13        0     6   1
3     134  0.002561         0  0.007682  0.002561  0  0   16  2.4    0    23   117   53  170   11      0  137  134    137       13        1     6   1
4     132  0.006515         0  0.008143         0  0  0   16  2.4    0  19.9   117   53  170    9      0  137  136    138       11        1     2   1
...   ...       ...       ...       ...       ... .. ..  ...  ...  ...   ...   ...  ...  ...  ...    ...  ...  ...    ...      ...      ...   ...  ..
2121  140         0         0  0.007426         0  0  0   79  0.2   25   7.2    40  137  177    4      0  153  150    152        2        0     5   2
2122  140  0.000775         0  0.006971         0  0  0   78  0.4   22   7.1    66  103  169    6      0  152  148    151        3        1     5   2
2123  140   0.00098         0  0.006863         0  0  0   79  0.4   20   6.1    67  103  170    5      0  153  148    152        4        1     5   2
2124  140  0.000679         0   0.00611         0  0  0   78  0.4   27     7    66  103  169    6      0  152  147    151        4        1     5   2
2125  142  0.001616  0.001616  0.008078         0  0  0   74  0.4   36     5    42  117  159    2      1  145  143    145        1        0     1   1

[2126 rows x 23 columns]

Reading Excel File: How to Find the Start Cell in Messy Spreadsheets