Scraping Data from Website Using Vba

Scraping data from website using vba

There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).

Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.

1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")

2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True, but when the code is working for scraping data is nice not to see it everytime so Visible = False.

With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With

3) The webpage will need some time to load. So, I will wait meanwhile it's busy...

Do While appIE.Busy
DoEvents
Loop

4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond:
What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.

Sample Image

5) What I should do is straight-forward. First of all, I will get by the ID property the tr element which is containing the value:

Set allRowOfData = appIE.document.getElementById("pair_8907")

Here I will get a collection of td elements (specifically, tr is a row of data, and the td are its cells. We are looking for the 8th, so I will write:

Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML

Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:

  • .Cells() makes me access the td elements;
  • innerHTML is the property of the cell containing the value we look for.

Once we have our value, which is now stored into the myValue variable, we can just close the IE browser and releasing the memory by setting it to Nothing:

appIE.Quit
Set appIE = Nothing

Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue), or into a label of a form (Me.label1.Text = myValue).

I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.

Enjoy your practice ;)

Login to web page and scraping data using VBA

The first issue seems to be that you have a mismatched comma on the third line; however, seeing as you're are complaining about an automation error I think that may be just a typo on this site.

I can't see your internal website so I'm just guessing but I suspect the next issue is that there is no element with ID "txtPwd". You can check to confirm by pressing Ctrl-Shift-C and then selecting the username and password entry boxes.

Finally, depending on how the site is set up your .document.forms(0).submit may not work. You may need to find the ID for the submit button class submit. Below is a function I created a while back for such a task:

Function logIn(userName As String, password As String) As Boolean
'This routine logs into the grade book using given credentials

Dim ie As New InternetExplorer
Dim doc As HTMLDocument
On Error GoTo loginFail

ie.Navigate "[website here]"
'ie.Visible = True
Do While ie.ReadyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop 'Wait server to respond
Set doc = ie.Document
doc.getElementsByName("u_name").Item(0).Value = userName 'These may be different for you
doc.getElementsByName("u_pass").Item(0).Value = password
doc.getElementsByClassName("btn").Item(0).Click

Do While ie.ReadyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop 'Wait server to respond
Set doc = ie.Document
'Add a check to confirm you aren't on the same page
ie.Quit
Set ie = Nothing
LogIn = True
Exit Function

loginFail:
MsgBox "There was an issue logging in. Please try again."
logIntoGradeBook = False
End Function

Note that the site I was dealing with was set up poorly and so I needed to switch to GetElementsByName and GetElementsByClassName to get access to what I needed. You may be fine with IDs.

Scraping Data from Website using vba doesn´t work

Since it's not possible to test the website on my own, the code below might not be the best way to do it but it should work:

Sub Kickbase()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLdoc As MSHTML.HTMLDocument
Dim HTMLPlayers As Object
Dim i As Integer
Dim firstName As String

IE.Visible = True
IE.navigate "https://play.kickbase.com/transfermarkt/kaufen"

Do While IE.readyState <> READYSTATE_COMPLETE
DoEvents
Loop

Application.Wait (Now + TimeValue("0:00:10"))

Set HTMLdoc = IE.document
Set HTMLPlayers = HTMLdoc.getElementsByClassName("playerName")

For i = 0 To HTMLPlayers(0).getElementsByClassName("firstName").Length - 1

firstName = Trim$(HTMLPlayers(0).getElementsByClassName("firstName")(i).innerText)
If firstName = vbNullString Then firstName = "no_value"

Debug.Print firstName
Next i

'=== Optional depending on your use case, remember to close IE or else it will remain there ===
'IE.Quit
'Set IE = Nothing

End Sub

Extracting Data from URL VBA getting IE not suppoting

Okay, try this to get the title and votes from that site using vba in combination with selenium.

Sub FetchInfo()
Dim driver As Object, oTitle As Object
Dim oVotes As Object

Set driver = CreateObject("Selenium.ChromeDriver")

driver.get "https://www.justdial.com/Agra/Yogi-General-Store-Opp-Eclave-Satiudum-Sadar-Bazaar/0562P5612-5612-120207212812-H5I2_BZDET"
Set oTitle = driver.FindElementByCss("span.item > span", Raise:=False, timeout:=10000)
Set oVotes = driver.FindElementByCss("span.rtngsval > span.votes", Raise:=False, timeout:=10000)
Debug.Print oTitle.Text, oVotes.Text
End Sub

Scraping a table from a website using VBA

You can avoid a browser and use xmlhttp to get the page content, then select the table element by its class (there is no id to use and class is the next fastest selector after id) and then loop the rows and columns writing out to sheet.

Option Explicit
Public Sub GetTable()
Dim html As MSHTML.HTMLDocument, hTable As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New MSHTML.HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://www.bkam.ma/Marches/Principaux-indicateurs/Marche-obligataire/Marche-des-bons-de-tresor/Marche-secondaire/Taux-de-reference-des-bons-du-tresor?date=13%2F02%2F2019&block=e1d6b9bbf87f86f8ba53e8518e882982#address-c3367fcefc5f524397748201aee5dab8-e1d6b9bbf87f86f8ba53e8518e882982", False
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".dynamic_contents_ref_12")
Dim td As Object, tr As Object, th As Object, r As Long, c As Long
For Each tr In hTable.getElementsByTagName("tr")
r = r + 1: c = 1
For Each th In tr.getElementsByTagName("th")
ws.Cells(r, c) = th.innerText
c = c + 1
Next
For Each td In tr.getElementsByTagName("td")
ws.Cells(r, c) = td.innerText
c = c + 1
Next
Next
End Sub

Scraping data from website with dynamic array function in vba

You need to pick up the right table given they are nested so change the index to 3. Otherwise, you are picking up the shared parent and thus all the listings are in fact within the one child element hence your current output.

Then you need to adjust your code to skip the first row.

N.B. You don't actually need IE for this as the content you want is static. You can use XMLHTTP. And you are writing out data to a different sheet than the one you end format.

Sub CopyFromHKAB()
Dim ie As Object, btnmore As Object, tbl As Object
Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer

ThisWorkbook.Sheets("data").UsedRange.Clear

Set ie = CreateObject("internetexplorer.application")
With ie
.Visible = True
.navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"

Do
DoEvents
Loop While .readyState <> 4 Or .Busy


Set tbl = .document.getElementsByClassName("etxtmed")(3)

End With

'get data from table
r = tbl.Rows.Length - 1
c = tbl.Rows(1).Cells.Length - 1

ReDim arr(0 To r, 0 To c)

Set rr = tbl.Rows

For i = 1 To r

Set cc = rr(i).Cells
For j = 0 To c
arr(i - 1, j) = cc(j).innertext
Next

Next

ie.Quit

'Application.ScreenUpdating = False
ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r + 1, c + 1) = arr

With ThisWorkbook.Worksheets("data")
.UsedRange.WrapText = False
.Columns.AutoFit
End With

End Sub

I would consider switching to XHR to avoid overhead of browser, and using querySelectorAll to allow for using a css selector list to target only the nodes of interest

Option Explicit

Public Sub GetHKABInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object

Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument

With xhr
.Open "GET", "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With

Dim arr() As Variant, nodes As MSHTML.IHTMLDOMChildrenCollection, i As Long

Set nodes = html.querySelectorAll(".etxtmed .etxtmed td")

ReDim arr(1 To nodes.Length - 1)

For i = LBound(arr) To UBound(arr)
arr(i) = nodes.Item(i).innertext
Next

ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(UBound(arr), 1) = Application.Transpose(arr)

End Sub

scraping data from website using vba - problem

This is a single node Set list2 = .Document.querySelectorAll("#main_table"). Instead, assuming same structure for all results use something like:

Dim i As Long, line1 As String, line2 As String, address As String

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

For i = 0 To list2.Length - 1
line1 = list2.Item(i).NextSibling.NextSibling.NodeValue
line2 = list2.Item(i).NextSibling.NextSibling.NextSibling.NodeValue
address = line1 & " " & line2 'apply string cleaning here
Next

This targets initially the hyperlinks for each result, then moves across the br elements with nextSibling to get the address line 1 and 2. You will need to write some string cleaning on the address variable.

If you decide to click each hyperlink, then on the detailed info page use .document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText to retrieve the full address.

An example of navigating to each page (check urls retrieved are complete and don't require a prefix)

Dim i As Long, address As String, urls(), numLinks As Long

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
numLinks = List.Length - 1
ReDim urls(0 To numLinks)

For i = 0 To numLinks
urls(i) = list2.Item(i).href
Next

For i = 0 To numLinks
.navigate2 urls(i)
While .Busy Or .ReadyState <> 4: DoEvents: Wend
'time loop maybe goes here
address = .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
Debug.Print address
Next

Scraping Webpage Tables Data Using VBA

Try to use F12 developer tools to check the Table HTML elements, we can see that there only have one <table> tag and one <tbody> element, in the tbody, the first row is the header row, others are the data row. In the header row, we can see that the <th> elements don't contain the <tr> tag

Sample Image

 Set hTable = doc.getElementsByTagName("table")

y = 2 'Column B in Excel
z = 3 'Row 3 in Excel
For Each tb In hTable
Set hHead = tb.getElementsByTagName("th")
For Each hh In hHead
Set hTR = hh.getElementsByTagName("tr")
For Each Tr In hTR

So, if we using the above code, after finding the <th> elements, it will not go deep to loop through the table.

Try to refer the following code:

Sub Test()
Dim IE As Object

Sheets("Sheet1").Select
Dim i As Long, strText As String

'Dim doc As Object, hTable As Object, hBody As Object, hTR As Object, hTD As Object
'Dim tb As Object, bb As Object, tr As Object, Td As Object

Dim y As Long, z As Long, wb As Excel.Workbook, ws As Excel.Worksheet

'Shell "RunDll32.exe Inetcpl.cpl,ClearMyTracksByProcess 11"

Set wb = Excel.ActiveWorkbook
Set ws = wb.ActiveSheet

Set IE = CreateObject("InternetExplorer.Application")
my_url = "https://www1.nseindia.com/products/content/equities/equities/eq_security.htm"

With IE
.Visible = True
.navigate my_url
.Top = 50
.Left = 530
.Height = 800
.Width = 800

Do Until Not IE.busy And IE.readyState = 4
DoEvents
Loop

End With
' Input the userid and password
'ie.document.getElementById("symbol").Value = Worksheets("Sheet1").Range("B1")
IE.document.getElementById("symbol").Value = "BAJFINANCE"
IE.document.getElementById("dateRange").selectedIndex = "4"
IE.document.getElementById("get").Click


While IE.busy
DoEvents
Wend

Set doc = IE.document

y = 2
z = 3

Dim table As Object, tbody As Object, datarow As Object, thlist As Object, trlist As Object

Application.Wait Now + TimeValue("00:00:02")

'find the tbody. Since it only conatin one table and tbody
Set tbody = IE.document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0)
'find tha theader
Set thlist = tbody.getElementsByTagName("tr")(0).getElementsByTagName("th")

'Debug.Print thlist.Length

'loop through the header column and capture the value.
Dim ii As Integer
For ii = 0 To thlist.Length - 1
ws.Cells(z, y).Value = thlist(ii).innerText
y = y + 1
Next ii

'get all data row
Set datarow = tbody.getElementsByTagName("tr")

'init the data row index and column index.
y = 2
z = 4

'loop through the data row and get all td. and then capture the value.
Dim jj As Integer
Dim datarowtdlist As Object

For jj = 1 To datarow.Length - 1
Set datarowtdlist = datarow(jj).getElementsByTagName("td")

'the x variable is used to set the column index.
Dim hh As Integer, x As Integer
x = y
For hh = 0 To datarowtdlist.Length - 1
ws.Cells(z, x).Value = datarowtdlist(hh).innerText
x = x + 1
Next hh
z = z + 1
Next jj

Set IE = Nothing

End Sub

The result:

Sample Image



Related Topics



Leave a reply



Submit