Scraping data from website using vba
There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).
Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.
1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True
, but when the code is working for scraping data is nice not to see it everytime so Visible = False
.
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With
3) The webpage will need some time to load. So, I will wait meanwhile it's busy...
Do While appIE.Busy
DoEvents
Loop
4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond:
What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.
5) What I should do is straight-forward. First of all, I will get by the ID property the tr
element which is containing the value:
Set allRowOfData = appIE.document.getElementById("pair_8907")
Here I will get a collection of td
elements (specifically, tr
is a row of data, and the td
are its cells. We are looking for the 8th, so I will write:
Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:
.Cells()
makes me access thetd
elements;innerHTML
is the property of the cell containing the value we look for.
Once we have our value, which is now stored into the myValue
variable, we can just close the IE browser and releasing the memory by setting it to Nothing:
appIE.Quit
Set appIE = Nothing
Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue
), or into a label of a form (Me.label1.Text = myValue
).
I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.
Enjoy your practice ;)
Login to web page and scraping data using VBA
The first issue seems to be that you have a mismatched comma on the third line; however, seeing as you're are complaining about an automation error I think that may be just a typo on this site.
I can't see your internal website so I'm just guessing but I suspect the next issue is that there is no element with ID "txtPwd". You can check to confirm by pressing Ctrl-Shift-C and then selecting the username and password entry boxes.
Finally, depending on how the site is set up your .document.forms(0).submit may not work. You may need to find the ID for the submit button class submit. Below is a function I created a while back for such a task:
Function logIn(userName As String, password As String) As Boolean
'This routine logs into the grade book using given credentials
Dim ie As New InternetExplorer
Dim doc As HTMLDocument
On Error GoTo loginFail
ie.Navigate "[website here]"
'ie.Visible = True
Do While ie.ReadyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop 'Wait server to respond
Set doc = ie.Document
doc.getElementsByName("u_name").Item(0).Value = userName 'These may be different for you
doc.getElementsByName("u_pass").Item(0).Value = password
doc.getElementsByClassName("btn").Item(0).Click
Do While ie.ReadyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop 'Wait server to respond
Set doc = ie.Document
'Add a check to confirm you aren't on the same page
ie.Quit
Set ie = Nothing
LogIn = True
Exit Function
loginFail:
MsgBox "There was an issue logging in. Please try again."
logIntoGradeBook = False
End Function
Note that the site I was dealing with was set up poorly and so I needed to switch to GetElementsByName and GetElementsByClassName to get access to what I needed. You may be fine with IDs.
Scraping Data from Website using vba doesn´t work
Since it's not possible to test the website on my own, the code below might not be the best way to do it but it should work:
Sub Kickbase()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLdoc As MSHTML.HTMLDocument
Dim HTMLPlayers As Object
Dim i As Integer
Dim firstName As String
IE.Visible = True
IE.navigate "https://play.kickbase.com/transfermarkt/kaufen"
Do While IE.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:10"))
Set HTMLdoc = IE.document
Set HTMLPlayers = HTMLdoc.getElementsByClassName("playerName")
For i = 0 To HTMLPlayers(0).getElementsByClassName("firstName").Length - 1
firstName = Trim$(HTMLPlayers(0).getElementsByClassName("firstName")(i).innerText)
If firstName = vbNullString Then firstName = "no_value"
Debug.Print firstName
Next i
'=== Optional depending on your use case, remember to close IE or else it will remain there ===
'IE.Quit
'Set IE = Nothing
End Sub
Extracting Data from URL VBA getting IE not suppoting
Okay, try this to get the title and votes from that site using vba in combination with selenium.
Sub FetchInfo()
Dim driver As Object, oTitle As Object
Dim oVotes As Object
Set driver = CreateObject("Selenium.ChromeDriver")
driver.get "https://www.justdial.com/Agra/Yogi-General-Store-Opp-Eclave-Satiudum-Sadar-Bazaar/0562P5612-5612-120207212812-H5I2_BZDET"
Set oTitle = driver.FindElementByCss("span.item > span", Raise:=False, timeout:=10000)
Set oVotes = driver.FindElementByCss("span.rtngsval > span.votes", Raise:=False, timeout:=10000)
Debug.Print oTitle.Text, oVotes.Text
End Sub
Scraping a table from a website using VBA
You can avoid a browser and use xmlhttp to get the page content, then select the table element by its class (there is no id to use and class is the next fastest selector after id) and then loop the rows and columns writing out to sheet.
Option Explicit
Public Sub GetTable()
Dim html As MSHTML.HTMLDocument, hTable As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New MSHTML.HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://www.bkam.ma/Marches/Principaux-indicateurs/Marche-obligataire/Marche-des-bons-de-tresor/Marche-secondaire/Taux-de-reference-des-bons-du-tresor?date=13%2F02%2F2019&block=e1d6b9bbf87f86f8ba53e8518e882982#address-c3367fcefc5f524397748201aee5dab8-e1d6b9bbf87f86f8ba53e8518e882982", False
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".dynamic_contents_ref_12")
Dim td As Object, tr As Object, th As Object, r As Long, c As Long
For Each tr In hTable.getElementsByTagName("tr")
r = r + 1: c = 1
For Each th In tr.getElementsByTagName("th")
ws.Cells(r, c) = th.innerText
c = c + 1
Next
For Each td In tr.getElementsByTagName("td")
ws.Cells(r, c) = td.innerText
c = c + 1
Next
Next
End Sub
Scraping data from website with dynamic array function in vba
You need to pick up the right table given they are nested so change the index to 3. Otherwise, you are picking up the shared parent and thus all the listings are in fact within the one child element hence your current output.
Then you need to adjust your code to skip the first row.
N.B. You don't actually need IE for this as the content you want is static. You can use XMLHTTP. And you are writing out data to a different sheet than the one you end format.
Sub CopyFromHKAB()
Dim ie As Object, btnmore As Object, tbl As Object
Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer
ThisWorkbook.Sheets("data").UsedRange.Clear
Set ie = CreateObject("internetexplorer.application")
With ie
.Visible = True
.navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"
Do
DoEvents
Loop While .readyState <> 4 Or .Busy
Set tbl = .document.getElementsByClassName("etxtmed")(3)
End With
'get data from table
r = tbl.Rows.Length - 1
c = tbl.Rows(1).Cells.Length - 1
ReDim arr(0 To r, 0 To c)
Set rr = tbl.Rows
For i = 1 To r
Set cc = rr(i).Cells
For j = 0 To c
arr(i - 1, j) = cc(j).innertext
Next
Next
ie.Quit
'Application.ScreenUpdating = False
ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r + 1, c + 1) = arr
With ThisWorkbook.Worksheets("data")
.UsedRange.WrapText = False
.Columns.AutoFit
End With
End Sub
I would consider switching to XHR to avoid overhead of browser, and using querySelectorAll
to allow for using a css selector list to target only the nodes of interest
Option Explicit
Public Sub GetHKABInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Dim arr() As Variant, nodes As MSHTML.IHTMLDOMChildrenCollection, i As Long
Set nodes = html.querySelectorAll(".etxtmed .etxtmed td")
ReDim arr(1 To nodes.Length - 1)
For i = LBound(arr) To UBound(arr)
arr(i) = nodes.Item(i).innertext
Next
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(UBound(arr), 1) = Application.Transpose(arr)
End Sub
scraping data from website using vba - problem
This is a single node Set list2 = .Document.querySelectorAll("#main_table")
. Instead, assuming same structure for all results use something like:
Dim i As Long, line1 As String, line2 As String, address As String
Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
For i = 0 To list2.Length - 1
line1 = list2.Item(i).NextSibling.NextSibling.NodeValue
line2 = list2.Item(i).NextSibling.NextSibling.NextSibling.NodeValue
address = line1 & " " & line2 'apply string cleaning here
Next
This targets initially the hyperlinks for each result, then moves across the br elements with nextSibling to get the address line 1 and 2. You will need to write some string cleaning on the address variable.
If you decide to click each hyperlink, then on the detailed info page use .document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
to retrieve the full address.
An example of navigating to each page (check urls retrieved are complete and don't require a prefix)
Dim i As Long, address As String, urls(), numLinks As Long
Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
numLinks = List.Length - 1
ReDim urls(0 To numLinks)
For i = 0 To numLinks
urls(i) = list2.Item(i).href
Next
For i = 0 To numLinks
.navigate2 urls(i)
While .Busy Or .ReadyState <> 4: DoEvents: Wend
'time loop maybe goes here
address = .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
Debug.Print address
Next
Scraping Webpage Tables Data Using VBA
Try to use F12 developer tools to check the Table HTML elements, we can see that there only have one <table>
tag and one <tbody>
element, in the tbody, the first row is the header row, others are the data row. In the header row, we can see that the <th>
elements don't contain the <tr>
tag
Set hTable = doc.getElementsByTagName("table")
y = 2 'Column B in Excel
z = 3 'Row 3 in Excel
For Each tb In hTable
Set hHead = tb.getElementsByTagName("th")
For Each hh In hHead
Set hTR = hh.getElementsByTagName("tr")
For Each Tr In hTR
So, if we using the above code, after finding the <th>
elements, it will not go deep to loop through the table.
Try to refer the following code:
Sub Test()
Dim IE As Object
Sheets("Sheet1").Select
Dim i As Long, strText As String
'Dim doc As Object, hTable As Object, hBody As Object, hTR As Object, hTD As Object
'Dim tb As Object, bb As Object, tr As Object, Td As Object
Dim y As Long, z As Long, wb As Excel.Workbook, ws As Excel.Worksheet
'Shell "RunDll32.exe Inetcpl.cpl,ClearMyTracksByProcess 11"
Set wb = Excel.ActiveWorkbook
Set ws = wb.ActiveSheet
Set IE = CreateObject("InternetExplorer.Application")
my_url = "https://www1.nseindia.com/products/content/equities/equities/eq_security.htm"
With IE
.Visible = True
.navigate my_url
.Top = 50
.Left = 530
.Height = 800
.Width = 800
Do Until Not IE.busy And IE.readyState = 4
DoEvents
Loop
End With
' Input the userid and password
'ie.document.getElementById("symbol").Value = Worksheets("Sheet1").Range("B1")
IE.document.getElementById("symbol").Value = "BAJFINANCE"
IE.document.getElementById("dateRange").selectedIndex = "4"
IE.document.getElementById("get").Click
While IE.busy
DoEvents
Wend
Set doc = IE.document
y = 2
z = 3
Dim table As Object, tbody As Object, datarow As Object, thlist As Object, trlist As Object
Application.Wait Now + TimeValue("00:00:02")
'find the tbody. Since it only conatin one table and tbody
Set tbody = IE.document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0)
'find tha theader
Set thlist = tbody.getElementsByTagName("tr")(0).getElementsByTagName("th")
'Debug.Print thlist.Length
'loop through the header column and capture the value.
Dim ii As Integer
For ii = 0 To thlist.Length - 1
ws.Cells(z, y).Value = thlist(ii).innerText
y = y + 1
Next ii
'get all data row
Set datarow = tbody.getElementsByTagName("tr")
'init the data row index and column index.
y = 2
z = 4
'loop through the data row and get all td. and then capture the value.
Dim jj As Integer
Dim datarowtdlist As Object
For jj = 1 To datarow.Length - 1
Set datarowtdlist = datarow(jj).getElementsByTagName("td")
'the x variable is used to set the column index.
Dim hh As Integer, x As Integer
x = y
For hh = 0 To datarowtdlist.Length - 1
ws.Cells(z, x).Value = datarowtdlist(hh).innerText
x = x + 1
Next hh
z = z + 1
Next jj
Set IE = Nothing
End Sub
The result:
Related Topics
Transparent Half Circle Cut Out of a Div
Make Grid Container Fill Columns Not Rows
Alternate Table Row Color Using Css
Trouble With Content Security Policy
Grid Properties Not Working on Elements Inside Grid Container
Onclick on Option Tag Not Working on Ie and Chrome
Draw on Html5 Canvas Using a Mouse
Css Content Generation Before or After 'Input' Elements
Css Selector by Inline Style Attribute
Should I Use Px or Rem Value Units in My Css
How to Center Text Vertically With a Large Font-Awesome Icon
Why Is a Div With "Display: Table-Cell;" Not Affected by Margin
How to Convert an HTML Element to a Canvas Element
How to Create a Template in Html