Web Scraping Data Table with R Rvest

rvest: Scraping table from webpage

That site is blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.

The html_nodes() function turns each HTML tag into a row in an R dataframe.

library(rvest)

## Loading required package: xml2

# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"

    scistarter_html <- read_html(URL)
    scistarter_html

## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    \n    \n    <svg style="position: absolute; width: 0; he ...

We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.

The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.

This grabs all the nodes that have links in them.

    scistarter_html %>%
      html_nodes("a") %>%
      head()

## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] <a href="/dashboard">My Account</a>
## [3] <a href="/finder" class="is-active">Project Finder</a>
## [4] <a href="/events">Event Finder</a>
## [5] <a href="/people-finder">People Finder</a>
## [6] <a href="#dialog-login" rel="modal:open">log in</a>

In a more complex example, we could use this to “crawl” the page, but that’s for another day.

Every div on the page:

    scistarter_html %>%
      html_nodes("div") %>%
      head()

## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n        <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n          < ...
## [3] <div class="nav-tools">\n            <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n              <div class="field">\n ...
## [5] <div class="field">\n                <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n                    <d ...

… the nav-tools div. This calls by css where class=nav-tools.

    scistarter_html %>%
      html_nodes("div.nav-tools") %>%
      head()

## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n            <div class="nav-tools__search"> ...

We can call the nodes by id as follows.

    scistarter_html %>%
      html_nodes("div#project-listing") %>%
      head()

## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n          \n       ...

All the tables as follows:

    scistarter_html %>%
      html_nodes("table") %>%
      head()

## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...

See the (related) link below, for more info.

https://rpubs.com/Radcliffe/superbowl

Scraping a table over multiple pages with Rvest

One can use the last argument in the URL, &start= to iterate through the results page by page. The search results page renders 25 items per page, so the sequence of pages is 25, 50, 75, 100...

We will obtain the first 5 pages of results, for a total of 125 transactions. Since the first page starts with &start=0, we assign a vector, startRows to represent the starting row for each page.

We then use the vector to drive lapply() with an anonymous function that reads the data and manipulates it to remove the header row from each page of data read.

library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
     url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date, 
                   "&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
     webpage <- xml2::read_html(url)
     data <- webpage %>%
          rvest::html_node(css = css_selector) %>%
          rvest::html_table() %>% 
          as_tibble()
     colnames(data) = data[1,]
     data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)

...and the output:

> head(data,n=10)
# A tibble: 10 x 5
   Date     Team      Acquired            Relinquished          Notes                          
   <chr>    <chr>     <chr>               <chr>                 <chr>                          
 1 1947-08… Bombers … ""                  "• Jack Underman"     fractured legs (in auto accide…
 2 1948-02… Bullets … "• Harry Jeannette… ""                    broken rib (DTD) (date approxi…
 3 1949-03… Capitols  ""                  "• Horace McKinney /… personal reasons (DTD)         
 4 1949-11… Capitols  ""                  "• Fred Scolari"      fractured right cheekbone (out…
 5 1949-12… Knicks    ""                  "• Vince Boryla"      mumps (out ~2 weeks)           
 6 1950-01… Knicks    "• Vince Boryla"    ""                    returned to lineup (date appro…
 7 1950-10… Knicks    ""                  "• Goebel Ritter / T… bruised ligaments in left ankl…
 8 1950-11… Warriors  ""                  "• Andy Phillip"      lacerated foot (DTD)           
 9 1950-12… Celtics   ""                  "• Andy Duncan (a)"   fractured kneecap (out indefin…
10 1951-12… Bullets   ""                  "• Don Barksdale"     placed on IL                   
>

Verifying the results

We can verify the results by printing the first and last rows from each page, starting with last observation on page 1.

data[c(25,26,50,51,75,76,100,101,125),]

...and the output, which matches the the content rendered on pages 1 - 5 of the search results when navigated manually on the website.

> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
  Date      Team        Acquired      Relinquished    Notes                                    
  <chr>     <chr>       <chr>         <chr>           <chr>                                    
1 1960-01-… Celtics     ""            "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics     ""            "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks      "• Art Heyma… ""              returned to lineup                       
4 1964-12-… Hawks       "• Bob Petti… ""              returned to lineup (date approximate)    
5 1968-11-… Nets (ABA)  ""            "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… ""            "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers      ""            "• Earnie Kill… dislocated left foot (out indefinitely)  
8 1970-10-… Lakers      ""            "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers   "• Austin Ca… ""              returned to lineup

If we look at the last page in the table, we find that the maximum value for the page series is &start=61475. The R code to generate the entire sequence of pages (2460, which matches the number of pages listed in the search results on the website) is:

# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))

...and the output:

> head(pages)
[1]   0  25  50  75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475

rvest -- how to scrape a table in a tab?

Data is dynamically pulled from an API call. You can pick up the key from the initial url and pass that into the API call and get the data back as json. I altered the pageSize param of the API call to be large enough to get all likely results.

library(rvest)
library(stringr)
library(jsonlite)

r <- read_html('https://fundkis.com/en/funds/bnppf-privatesustainable-balanced/BE6294262298#navs') 

share_id <- stringr::str_match(r %>% html_node('[name=ReactRiskPart]') %>% html_attr('props'), '"shareId": "(.*?)"')[,2]

api_url <- sprintf('https://fundkis.com/api/fkdb/navs/%s?PageIndex=0&PageSize=2000', share_id)

data<- jsonlite::read_json(api_url)

As dataframe:

library(rvest)
library(stringr)
library(jsonlite)
library(tidyverse)
library(purrr)

r <- read_html("https://fundkis.com/en/funds/bnppf-privatesustainable-balanced/BE6294262298#navs")

share_id <- stringr::str_match(r %>% html_node("[name=ReactRiskPart]") %>% html_attr("props"), '"shareId": "(.*?)"')[, 2]

api_url <- sprintf("https://fundkis.com/api/fkdb/navs/%s?PageIndex=0&PageSize=2000", share_id)

data <- jsonlite::read_json(api_url)

df <- map_dfr(data, data.frame) %>%
  mutate(Date = format.Date(NavDate)) %>%
  select(-c("FundShareId", "NavDate")) %>%
  rename(Currency = NavCurrencyISO, `Net Asset` = TotalAum, VL = Nav, `Nb Shares` = NbShares) %>%
  relocate(Date, Currency, VL, `Nb Shares`, `Net Asset`)

scraping an interactive table in R with rvest

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see

<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0" 
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)" 
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)" 
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>

but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.

You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.

UPDATE

Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

The .csv doesn't have column headers, but those are in the <script> as well

var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013", 
"Population<br>2014", 
"PopCh<br>2013-14", 
"%PopCh<br>2013-14", 
"MHI<br>2013", 
"MHI<br>2014", 
"MHI<br>Change<br>2013-14", 
"%MHI<br>Change<br>2013-14", 
"MFI<br>2013", 
"MFI<br>2014", 
"MFI<br>Change<br>2013-14", 
"%MFI<br>Change<br>2013-14", 
"MHV<br>2013", 
"MHV<br>2014", 
"MHV<br>Change<br>2013-14", 
"%MHV<br>Change<br>2013-14", 

]

Programmatic Solution

Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script> notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.

library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>% 
  html_nodes("script") %>% 
  html_text() %>% 
  grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)