Scraping from Aspx Website Using R

Scraping from aspx website using R

require(httr)
require(XML)

basePage <- "http://capitol.hawaii.gov"

h <- handle(basePage)

GET(handle = h)

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

appUrls <- resUrls[include]

# look at just the first

res <- GET(handle = h, path = appUrls[1])

resXML <- htmlParse(content(res, as = "text"))

xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
 Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
 Tokioka voting no (4) and none excused (0)."

Let package httr handle all the background work by setting up a handle.

If you want to run over all 92 links:

 # get all the links returned as a list (will take sometime)
 # print statement included for sanity
 res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
                                   GET(handle = h, path = x)})
 resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
 appString <- sapply(resXML, function(x){
                   xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
                      })

 head(appString)

>  head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                                  
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  1 Excused: Ige."                    
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                        
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."  
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

web scraping aspx web page with R

The comments above are correct the html code is populated dynamically thus the rvest library will not work. If you load the web page with the developer tools turned on and examine the files downloaded, there are couple of files of XHR type. If you examine these files these file the one named FlightTracker.ashx is a JSON file containing the information you are requesting.

Once the file and curl is determine it is just the matter of making a httr request and parse the JSON file:

library(httr)
library(jsonlite)

url<-'http://www.phl.org/_layouts/15/Fuseideas.PHL.SharePoint/FlightTrackerXml.ashx?dir=A'
flightdata<-GET(url)

output<- content(flightdata, as="text") %>% fromJSON(flatten=FALSE)

FYI: you may want to look at this file:

'http://www.phl.org/Style%20Library/PHL/Scripts/Angular/iata-data.jsn' which contains information on airlines' and airports' abbreviations, names and links.

Scraping web page that is only accesible after submitting aspx form

This is tricky, but possible.

The first difficulty you have is that when you send a GET request (via html_session) to the url "https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx", you are sending it without any session cookies. This makes the server redirect you to a different page, "https://nces.ed.gov/ipeds/use-the-data", and it is this page that you are seeing in your variable sesh.

However, since rvest (actually httr underneath rvest) re-uses session handles, all you need to do to overcome this problem is navigate to the login page, which allows httr to pick up the session cookies you need to browse as an anonymous user.

Here, we will also set our user agent to firefox.

library(httr)
library(rvest)
library(tibble)

url1    <- "https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=8"
url2    <- "https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx"

UA      <- "Mozilla/5.0 (Windows NT 6.1; rv:75.0) Gecko/20100101 Firefox/75.0"

html <- GET(url1, user_agent(UA))
html <- GET(url2, user_agent(UA))
page <- html %>% read_html()

Now page contains the page with the form that you want to submit. And this is where we come to the second difficulty. The easiest way to send a form is with rvest::submit_form(), but that doesn't seem to work because not all the fields are complete. We therefore need to build the form manually using rvest's scraping tools:

form <- list(`__VIEWSTATE` = page %>%
                html_node(xpath = "//input[@name='__VIEWSTATE']") %>%
                html_attr("value"),
             `__VIEWSTATEGENERATOR` = page %>%
                html_node(xpath = "//input[@name='__VIEWSTATEGENERATOR']") %>%
                html_attr("value"),
             `__EVENTVALIDATION` = page %>%
                html_node(xpath = "//input[@name='__EVENTVALIDATION']") %>%
                html_attr("value"),
             `ctl00$contentPlaceHolder$ddlYears` = "-1",
             `ddlSurveys` = "-1",
             `ctl00$contentPlaceHolder$ibtnContinue.x` = sample(50, 1),
             `ctl00$contentPlaceHolder$ibtnContinue.y` = sample(20, 1))

We can now submit this form, but before we do so, we need to add some headers, without which the server will throw a http 500:

Headers <- add_headers(`Accept-Encoding` = "gzip, deflate, br", 
                       `Accept-Language` = "en-GB,en;q=0.5", 
                       `Connection` = "keep-alive", 
                       `Host` = "nces.ed.gov", 
                       `Origin` = "https://nces.ed.gov", 
                       `Referer` = url2, 
                       `Upgrade-Insecure-Requests` = "1")

Finally, there is a cookie that is normally added via javascript that we will need to add manually:

Cookies <- set_cookies(setNames(c(cookies(html)$value, "true"),
                                c(cookies(html)$name, "fromIpeds")))

Now we can post our form with the correct form, headers and cookies to get the page you wanted:

Result  <- POST(url2, body = form, user_agent(UA), Headers, Cookies)

You can now scrape this page however you like. As an example, I will show that the text of the results table can be scraped quite easily:

Result %>% 
 read_html() %>% 
 html_node("#contentPlaceHolder_tblResult") %>% 
 html_table() %>%
 as_tibble()
#> # A tibble: 1,090 x 7
#>     Year Survey    Title        `Data File` `Stata Data Fil~ Programs Dictionary
#>    <int> <chr>     <chr>        <chr>       <chr>            <chr>    <chr>     
#>  1  2018 Institut~ Directory i~ HD2018      HD2018_STATA     SPSS, S~ Dictionary
#>  2  2018 Institut~ Educational~ IC2018      IC2018_STATA     SPSS, S~ Dictionary
#>  3  2018 Institut~ Student cha~ IC2018_AY   IC2018_AY_STATA  SPSS, S~ Dictionary
#>  4  2018 Institut~ Student cha~ IC2018_PY   IC2018_PY_STATA  SPSS, S~ Dictionary
#>  5  2018 Institut~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#>  6  2018 12-Month~ 12-month un~ EFFY2018    EFFY2018_STATA   SPSS, S~ Dictionary
#>  7  2018 12-Month~ 12-month in~ EFIA2018    EFIA2018_STATA   SPSS, S~ Dictionary
#>  8  2018 12-Month~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#>  9  2018 Admissio~ Admission c~ ADM2018     ADM2018_STATA    SPSS, S~ Dictionary
#> 10  2018 Admissio~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#> # ... with 1,080 more rows

^{Created on 2020-03-31 by the reprex package (v0.3.0)}

Scraping in R, aspx form don't know how to get data

the main issue and problem is that this is a asp.net web site. So, when you select a row, this likely uses a server side event. you MIGHT be able to write some JavaScript to select a row. But then the next issue is even more of a challenge. Once you select a row, then you have to click on a button. That button is going to run server side code. And that server side code is going to look at and grab the row value selected - VERY likely again server side code. Unlike say a simple web site with hyper-links?

.net sides are full driven from vb.net or c# code. We don't use silly things like hyper-links, or even silly parameters in the web URL.

So, after you select a row (perhaps possible in js), then you would then have to click on the details button. This again can be done with JavaScript

Say, in jQuery like this:

$('#NameOfButton').click();

So asp.net sites don't use simple code like what you see and get from someone who take that 3 day web developer program promising that now you are a experienced web developer. Asp.net sites as a result don't use simple HTML markup code and things like a simple hyper-link to drive the web site. There are no "links" for each row, but only code on the server side that runs to pull the data from the database, and then render that information, and THEN send it down as a html markup.

The bottom line?
The site is not simply HTML and simple hyper-links that you click on. When you click on that button, then the code behind (written in a nice language like c# or vb.net) runs. There is thus no markup code or even JavaScript code that is required here. You talking about clean and nice server side code. (and code written in a fantastic IDE - Visual Studio).

This means that aspx web sites are code behind driven, and as a result they are rather difficult to web scape in a automated fashion. You can get/grab the page you are on, but since there are no hyper-links to the additonal data (such as details), then you don't have a simple URL to follow/trace here.

Worse yet, the setup code (what occurs when you selected a single row) also in most cases has to be run. Only if all values are setup 100% correctly BEFORE hitting the "details" button will this thus work. And even worse, if you note, on the details page, there is no parameters in the URL. This means that not only is code behind required to run BEFORE the 2nd details page launches, but the correct setup code behind has to run. And even worse yet, is the 2nd page URL VERY likely also checks and ensures that the previous URL page was from the same site (as a result you can NOT JUST type in a url for the 2nd page - it will not work.

And in fact, if you look even closer? When you hit details button, the web pages re-loads, re-plots and renders what is CLEARLY a whole new web page and layout.

But note how the URL DOES NOT change!!! They are NOT even using a iframe for this.

This is because they are using what is called a server side re-direct. The key "tell tell" sign is that the URL remains the same, but the whole page layout is 100% different. What occurred is the server side did a re-direct to a 100% whole new page. But since the browser did not and was not causing this navigation? The code behind actually loads + displays a whole new web page and sends it down to the web client side.

However, note how the URL remains the same!!! This is due to the code behind is loading + displaying a whole new different web page - but since the navigation to that new page occurred with server side code?

Well then the server can load + send out anything it wants to the client - include a whole new web page, and you don't get nor see a web url change.

Again, this is typical of asp.net systems in which server side code drives the web site, and not much client side code.

You "might" be able to automate scraping. But you would need some custom code to select a given row, and then some code to click the details button. And that's going to be a REAL challenge, since any changes to the web page code (by you) also tend to be check for, and not allow server side.

The only practical web scrape approach would be to use some desktop tools to create a WHOLE instance of the web browser, let you the user navigate to the given web page that displays the data, and then hit some "capture" button in your application that now reads and parses out the data like you doing now for the main page.

Scraping from Aspx Website Using R