Html Parsing of Cricinfo Scorecards

html parsing of cricinfo scorecards

There are 2 techniques that I use for "VBA". I will describe them 1 by one.

1) Using FireFox / Firebug Addon / Fiddler

2) Using Excel's inbuilt facility to get data from the web

Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know

1) Using FireFox / Firebug Addon / Fiddler

FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)

Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)

Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)

Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.

Sample Image

Another example would be finding the name of a table on a website which has the data that you need scrapped.

I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.

Now let's take a simple example on how to scrape the URL posted in your question

http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html

First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.

Sample Image

So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1

Before we proceed, I would recommend, closing all Excel and starting a fresh instance.

Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this

Sample Image

Paste this code in the Userform code area

Option Explicit

'~~> Set Reference to Microsoft HTML Object Library

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Private Sub CommandButton1_Click()
    Dim URL As String
    Dim oSheet As Worksheet

    Set oSheet = Sheets("Sheet1")

    URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"

    PopulateDataSheets oSheet, URL

    MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub

Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
    Dim tbl As HTMLTable
    Dim tr As HTMLTableRow
    Dim insertRow As Long, Row As Long, col As Long

    On Error GoTo whoa

    WebBrowser1.navigate URL

    WaitForWBReady

    Set tbl = WebBrowser1.Document.getElementById("inningsBat1")

    With wsk
        .Cells.Clear

        insertRow = 0
        For Row = 0 To tbl.Rows.Length - 1
            Set tr = tbl.Rows(Row)
            If Trim(tr.innerText) <> "" Then
                If tr.Cells.Length > 2 Then
                    If tr.Cells(1).innerText <> "Total" Then
                        insertRow = insertRow + 1
                        For col = 0 To tr.Cells.Length - 1
                            .Cells(insertRow, col + 1) = tr.Cells(col).innerText
                        Next
                    End If
                End If
            End If
        Next
    End With
whoa:
    Unload Me
End Sub

Private Sub Wait(ByVal nSec As Long)
    nSec = nSec + Timer
    While Timer < nSec
       DoEvents
        Sleep 100
    Wend
End Sub

Private Sub WaitForWBReady()
    Wait 1
    While WebBrowser1.ReadyState <> 4
        Wait 3
    Wend
End Sub

Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot

Sample Image

Similarly you can scrape other info as well.

2) Using Excel's inbuilt facility to get data from the web

I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.

Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.

Sample Image

Enter the url in the "New Web Query Window" and click on "Go"

Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"

Sample Image

Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.

If you wish you can record a macro and automate this as well :)

Here is the macro that I recorded.

Sub Macro1()
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
    , Destination:=Range("$A$1"))
        .Name = "524915"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlSpecifiedTables
        .WebFormatting = xlWebFormattingNone
        .WebTables = """inningsBat1"""
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

Hope this helps. Let me know if you still have some queries.

Sid

Android : How to Parse HTML tags and retrieve information from an HTML page

You can open an stream to read through the contents of the URL, and then extract words between <title> and </title> tags with the help of String.substring() and String.indeOf() methods (The dirty way). or follow what this link says (regex)

Cricket Live Scores : RSS Feeds or Parsing entire HTML?

I would recommend you to parse RSS feeds of

www.espncricinfo.com

No one will give you for free the complete detailed scorecard, venues, ball by ball comparison, graphs , run rate and other things.

If you can afford you can request the concerned authorities of www.espncricinfo.com to give you the RSS feeds for money. Else for free/educational purposes, you'll have to parse their entire html pages. They are a very authentic genuine website, and have the same layout for many years now. So your HTML parsing would work for a long time for educational purposes ...

How to extract player names using Python with BeautifulSoup from cricinfo

While this is possible with BeautifulSoup, it's not the best tool for the job. All that data (and much more) is available through the API. Simply pull that and then you can parse the json to get what you want (and more). Here's a quick script though to get the 11 players for each team:

You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)

import requests

url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'

payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}

jsonData = requests.get(url, params=payload).json()

roster = jsonData['rosters']

players = {}
for team in roster:
    players[team['team']['displayName']] = []
    for player in team['roster']:
        playerName = player['athlete']['displayName']
        players[team['team']['displayName']].append(playerName)

Output:

print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}

See below:

Sample Image

How can I parse querystring from webpage?

First, your code is very hard to read. You need to let your code breathe and make it appealing for others to read it.

Second, what is causing issue is probably this line:

match_no = [x['href'].split('/',4)[4].split('.')[0] for x in soup.findAll('a', href=True, text='Scorecard')]

It is hard to read too. There are far more better and readable ways of parsing match id from URL.

Here is example of what should be working. I did take provisional date for matches:

import re

import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match

"""python espncricinfo library module https://github.com/dwillis/python-espncricinfo """
# from espncricinfo.match import Match


def get_match_id(link):
    match_id = re.search(r'([0-9]{7})', link)
    if match_id is None:
        return None
    return match_id.group()

# ----time-zone-calculation----
time_zone = pytz.timezone("Asia/Kolkata")
datetime_today = datetime.datetime.now(time_zone)
datestring_today = datetime_today.strftime("%Y-%m-%d")

# ------URL of page to parse-------with a date of today-----
url = "http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today"

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

spans = soup.findAll('span', {"class": "match-no"})

matches_ids = []

for s in spans:
    for a in s.findAll('a', href=lambda href: 'scorecard' in href):
        match_id = get_match_id(a['href'])
        if match_id is None:
            continue
        matches_ids.append(match_id)


# ------parsing for matchno------
for p in matches_ids:
    # where p is a match no, e.g p = '1122282'
    m = Match(p)
    m.latest_batting
    print(m.latest_batting)

Now, I didn't have every lib that you are using here, but this should give you an idea of how to do it.

Once again, my advice is that empty lines are your friends. They are reader's friends for sure. Make your code 'breathe'.

Extract data from a web page that may not be formatted as a table

So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:

Option Explicit

Private Sub CommandButton1_Click()

'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
 'Microsoft HTML object Library.
 'Code will NOT run otherwise.

Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection

Set objIE = New SHDocVw.InternetExplorer

Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library

Dim RowNumber As Integer
            RowNumber = 1

With objIE
    .Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
    .Visible = 0
    Do While .READYSTATE <> 4: DoEvents: Loop
        Application.Wait (Now + TimeValue("0:00:01"))


        Set htmlDoc = .document

        Dim ButtonRoundData As Variant
        Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")

        Dim ButtonData As Variant
        Set ButtonData = htmlDoc.getElementsByClassName("groups_link")



        Dim button As HTMLLinkElement
        For Each button In ButtonData

           Debug.Print button.nodeName

            button.Click

               Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then

            Set htmlCurrentDoc = .document
            Dim RawData As HTMLTable
            Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)



            Dim ColumnNumber As Integer
            ColumnNumber = 1

            Dim hRow As HTMLTableRow
            Dim hCell As HTMLTableCell
            For Each hRow In RawData.Rows

                For Each hCell In hRow.Cells
                    Cells(RowNumber, ColumnNumber).Value = hCell.innerText
                    ColumnNumber = ColumnNumber + 1
                Next hCell
                ColumnNumber = 1
                RowNumber = RowNumber + 1
            Next hRow

            RowNumber = RowNumber + 3
        Next button
    End With

End Sub

What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...

for Debugging i suggest to set .Visible to 1, so you will se what happens.

EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.

EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.

Here some sample data i got in Excel:

Sample Data

Extract values from HTML TD and Tr

After some fiddling I have derived a regex/VBA solution using

XMLHTTP to access the site (change strSite to suit)
a Regexp to get the required numbers
a variant array with 20 records to hold, then dump the numbers to the active sheet

output
Looking at the source HTML to find Regex patterns

The Call options have a common starting and finishing string that delimit the 10 values, but there are three different strings

Strings 1-4,7-10 for each record match <td class="ylwbg">X</td>
String 6 has a Style (and other text) preceding the > before the X
String 5 contains a much longer <a href textX</a>

A regex of
.Pattern = "(<tdclass=""ylwbg"")(Style.+?){0,1}>(.+?)(<\/td>)"
extracts all the needed strings, but further work is needed later on string 5

The Put options start with <td class="nobg" so these are happily not extracted by a regex that gets points 1-3

Sample Image
Actual Code

    Sub GetTxt()
    Dim objXmlHTTP As Object
    Dim objRegex As Object
    Dim objRegMC As Object
    Dim objRegM As Object
    Dim strResponse As String
    Dim strSite As String
    Dim lngCnt As Long
    Dim strTemp As String
    Dim X(1 To 20, 1 To 10)
    X(1, 1) = "OI"
    X(1, 2) = "Chng in vol"
    X(1, 3) = "Volume"
    X(1, 4) = "IV"
    X(1, 5) = "LTP"
    X(1, 6) = "Net Chg"
    X(1, 7) = "Bid Qty"
    X(1, 8) = "Bid Price"
    X(1, 9) = "Ask Price"
    X(1, 10) = "Ask Qnty"

    Set objXmlHTTP = CreateObject("MSXML2.XMLHTTP")
    strSite = "http://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionDates.jsp?symbol=NIFTY&instrument=OPTIDX&strike=4700.00"

    On Error GoTo ErrHandler
    With objXmlHTTP
        .Open "GET", strSite, False
        .Send
        If .Status = 200 Then strResponse = .ResponseText
    End With
    On Error GoTo 0

    Set objRegex = CreateObject("vbscript.regexp")
    With objRegex
        '*cleaning regex* to remove all spaces
        .Pattern = "[\xA0\s]+"
        .Global = True
        strResponse = .Replace(strResponse, vbNullString)
        .Pattern = "(<tdclass=""ylwbg"")(Style.+?){0,1}>(.+?)(<\/td>)"
        If .Test(strResponse) Then
            lngCnt = 20
            Set objRegMC = .Execute(strResponse)
            For Each objRegM In objRegMC
                lngCnt = lngCnt + 1
                If Right$(objRegM.submatches(2), 2) <> "a>" Then
                    X(Int((lngCnt - 1) / 10), IIf(lngCnt Mod 10 > 0, lngCnt Mod 10, 10)) = objRegM.submatches(2)
                Else
                'Get submatches of the form <a href="/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=4700.00&type=CE&expiry=23FEB2012" target="_blank"> 206.40</a>
                    strTemp = Val(Right(objRegM.submatches(2), Len(objRegM.submatches(2)) - InStrRev(objRegM.submatches(2), """") - 1))
                    X(Int((lngCnt - 1) / 10), IIf(lngCnt Mod 10 > 0, lngCnt Mod 10, 10)) = strTemp
                End If
            Next
        Else
            MsgBox "Parsing unsuccessful", vbCritical
        End If
    End With
    Set objRegex = Nothing
    Set objXmlHTTP = Nothing
    [a1].Resize(UBound(X, 1), UBound(X, 2)) = X
    Exit Sub
ErrHandler:
    MsgBox "Site not accessible"
    If Not objXmlHTTP Is Nothing Then Set objXmlHTTP = Nothing
End Sub

Html Parsing of Cricinfo Scorecards