Scraping JavaScript Website in R

Scraping javascript website in R

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn't require java):

library(rvest)

# render HTML from the site with phantomjs

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

## [1] "10:20 AM, October 28, 2014"

Web scraping with R and rvest when javascript-rendered content in the web page

Here's what it would look like using RSelenium to get the page to load.

library(rvest)
library(RSelenium)
remDr <- rsDriver(browser='chrome', port=4444L)
brow <- remDr[["client"]]
brow$open()
brow$navigate("https://www.filmweb.no/kinotoppen/")
h <- brow$getPageSource()
h <- read_html(h[[1]])
h %>% html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>% 
  html_text()
# [1] "Spider-Man: No Way Home"              "Clifford: Den store røde hunden"      "Lise & Snøpels - Venner for alltid"  
# [4] "Familien Voff - alle trenger en venn" "Nightmare Alley"                      "Snødronningen"                       
# [7] "Scream"                               "Bergman Island"                       "Trøffeljegerne fra Piemonte"         
# [10] "Encanto"

Reading the content of a Javascript-rendered webpage into R

Data is coming from an API call returning json. You can make the same GET request and then extract the usernames. Swop x$UserName with x$CustomerId for ids.

library(jsonlite)

data <- jsonlite::read_json('https://www.etoro.com/sapi/rankings/rankings/?activeweeksmin=24&blocked=false&bonusonly=false&copiersmax=5000©block=false©investmentpctmax=0©tradespctmax=0&dailyddmin=-10&displayfullname=true&gainmax=100&gainmin=5&hasavatar=true&highleveragepctmax=10&isfund=false&istestaccount=false&lastactivitymax=14&longpospctmax=80&lowleveragepctmin=50&maxdailyriskscoremax=5&maxmonthlyriskscoremax=5&maxmonthlyriskscoremin=1&optin=true&page=1&pagesize=20&period=OneYearAgo&profitableweekspctmin=50&sort=-gain&tradesmin=20&verified=true&weeklyddmin=-20&winratiomax=85')

users <- lapply(data$Items, function(x) {x$UserName})

Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

You are right - the contents on the page are updated by javascript via an ajax request. The server returns a json string in response to an http POST request. With POST requests, the server's response is determined not only by the url you request, but by the body of the message you send to the server. In this case, your body is a simple form with 3 fields: gameName, which is always LOTTO, isAjax which is always true, and drawNumber, which is the field you want to vary.

If you are using httr, you specify these fields as a named list in the body parameter of the POST function.

Once you have the response for each draw, you will want to parse the json into an R-friendly format such as a list or data frame using a library such as jsonlite. From looking at the structure of this particular json, it makes most sense to extract the component $data$drawDetailsand make that a one-row dataframe. This will allow you to bind several draws together into a single data frame.

Here is a function that does all that for you:

lotto_details <- function(draw_numbers)
{
 do.call("rbind", lapply(draw_numbers, function(x)
 {
   res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
                            "?task=results.redirectPageURL&",
                            "Itemid=265&option=com_weaver&",
                            "controller=lotto-history"),
                     body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
   as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
 }))
}

Which you use like this:

lotto_details(2009:2012)
#>   drawNumber   drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1       2009 2020/04/01   2020/04/04    51    15     7    32    42    45
#> 2       2010 2020/04/04   2020/04/08    43     4    21    24    10     3
#> 3       2011 2020/04/08   2020/04/11    42    43     8    18     2    29
#> 4       2012 2020/04/11   2020/04/15    48     6    43    41    25    45
#>   bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1         1           0          0           0          0          21
#> 2        22           0          0           0          0          31
#> 3        34           0          0           0          0          21
#> 4        38           1 10546013.8           0          0          28
#>   div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1     8455.3          60     2348.7        1252        189        1786
#> 2     6004.3          71     2080.6        1808      137.3        2352
#> 3     8584.5          60     2384.6        1405      171.1        2079
#> 4     7676.4          62     2751.4        1389      206.3        1872
#>   div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1      115.2       24664         50       19711         20     3809758.17
#> 2       91.7       35790         50       25981         20     5966533.86
#> 3      100.5       27674         50       21895         20     8055430.87
#> 4        133       28003         50       20651         20              0
#>   rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1              2     6198036.67    9879655          6000000
#> 2              3     9073426.56   11696905          8000000
#> 3              4    10649716.37   10406895         10000000
#> 4              0     13280236.5   11610950          2000000
#>   guaranteedJackpot drawMachine ballSet    status winners millionairs
#> 1                 0        RNG2     RNG published   47494           0
#> 2                 0        RNG2     RNG published   66033           0
#> 3                 0        RNG2     RNG published   53134           0
#> 4                 0        RNG2     RNG published   52006           1
#>   gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1     47494         0         0         0         0         0         0
#> 2     66033         0         0         0         0         0         0
#> 3     53134         0         0         0         0         0         0
#> 4     52006         0         0         0         0         0         0
#>   kznwinners nwwinners
#> 1          0         0
#> 2          0         0
#> 3          0         0
#> 4          0         0

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

R / Rvest / RSelenium: scrape data from JS Sites

I've had no problems using RSelenium with the help of the wdman package, which allowed me to just not bother with Docker. wdman also fetches all binaries you need if they aren't already available. It's nice magic.

Here's a simple script to spin up a Selenium instance with Chrome, open a site, get the contents as xml and then close it all down again.

library(wdman)
library(RSelenium)
library(xml2)

# start a selenium server with wdman, running the latest chrome version
selServ <- wdman::selenium(
  port = 4444L,
  version = 'latest',
  chromever = 'latest'
)

# start your chrome Driver on the selenium server
remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4444L,
  browserName = 'chrome'
)

# open a selenium browser tab
remDr$open()

# navigate to your site
remDr$navigate(some_url)

# get the html contents of that site as xml tree
page_xml <- xml2::read_html(remDr$getPageSource()[[1]])

# do your magic
# ... check doc at `?remoteDriver` to see what your remDr object can help you do.

# clean up after you
remDr$close()
selServ$stop()

How to scrape this website in R using rvest?

You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:

library(rvest)
library(magrittr)

r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>% 
  session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')

tables <- r %>% read_html() %>% html_table()

web scraping Javascript heavy site using rvest

There an API which you can use,

library(jsonlite)
df = fromJSON('https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch.php?ids=')
head(df$markers)
           lat          lng                       type
1 -11.78449871 -70.73347813 Forest and land-use carbon
2    17.067346    94.459977 Forest and land-use carbon
3     3.054216   -72.333984 Forest and land-use carbon
4     20.98685    -89.03344 Forest and land-use carbon
5    -0.886093      30.5798 Forest and land-use carbon
6    -1.809978    31.131299 Forest and land-use carbon
                                                                                                                  title               location pid  size
1                                                                                           Reforestadores REDD Project    Madre de Dios, Peru   1 85000
2 Reforestation and Restoration of degraded mangrove lands, sustainable livelihood and community development in Myanmar                Myanmar   2  2575
3                                                                              San Nicolas Carbon Sequestration Project San Nicholas, Colombia   3  7300
4                                                                                             Amigos de Calakmul Mexico     Selva Maya, Mexico   4 56700
5                                                                          Uganda Nile Basin Reforestation Project No 4                 Uganda   5   347
6                                                                                                    Emiti Nibwo Bulora    Nyaishozi, Tanzania   6   130

Scraping Javascript rendered content using R

Though it uses javascript, it sends JSON. You can aviod using javascript by using the hidden api:

library(rvest)
library(jsonlite)
my_url <- "https://www.kroger.com/cl/api/coupons?couponsCountPerLoad=418&sortType=relevance&newCoupons=false" #hidden api
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("p") %>% html_text()
data <- fromJSON(content)
mydata <- data$data$coupons

> glimpse(mydata)
Observations: 418
Variables: 19
$ id                     <int> 2149194, 2149191, 2127870, 2129277, 2128587, 2126349, 2121480, 2128278, 2157633, 2169615, 2159613, 2140047, 2159769, 2167485, 2141526...
$ brandName              <chr> "Other", "Other", "Store Brand", "Store Brand", "Store Brand", "Store Brand", "Sargento", "Hallmark", "Colgate", "Oscar Mayer", "Kett...
$ longDescription        <chr> "Selling or purchasing fuel points is prohibited. Fuel redemption offer cannot be combined with any other discounts. No discounts to ...
$ shortDescription       <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save $0.50 on 2 Kroger...
$ requirementDescription <chr> "when you buy a participating gift card. *Restrictions apply, see store for details.", "when you buy a $25, $50 or $100 Mastercard® o...
$ categories             <list> ["Gift Cards", "Gift Cards", "General", "Snacks", <"Promotions", "Frozen">, "General", "Dairy", "General", <"Baking Goods", "Health ...
                                 $ expirationDate         <chr> "2018-05-13T04:00:00Z", "2018-05-13T04:00:00Z", "2018-07-29T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-29T0...
                                 $ lastRedemptionDate     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...
                                 $ displayStartDate       <chr> "2018-05-07T04:00:00Z", "2018-05-07T04:00:00Z", "2018-04-30T04:00:00Z", "2018-04-18T04:00:00Z", "2018-04-18T04:00:00Z", "2018-05-02T0...
                                 $ imageUrl               <chr> "https://cdnws.softcoin.com/mediaCache/ecoupon_1585374.png", "https://cdnws.softcoin.com/mediaCache/ecoupon_1585365.png", "https://cd...
                                 $ krogerCouponNumber     <chr> "800000013010", "800000013711", "10000008220", "800000012111", "800000012554", "800000014782", "800000015150", "800000022503", "80000...
                                 $ addedToCard            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
                                 $ canBeAddedToCard       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
                                 $ canBeRemoved           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
                                 $ filterTags             <list> [<"FT4XGRAD", "FTBL4XGRADFM", "FTBL4XGRAD", "FTBL4XMOMGC", "FTBL4XMOM1", "4XGCWEEKEND", "FTBL4XGRAD2", "KPF">, <"FTBL4XGRAD1", "4XGC...
                                                                  $ title                  <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save 50¢", "Save 50¢",...
                                                                  $ displayDescription     <chr> "", "", "", "on 2 Kroger Potato Chips", "on 2 Kroger Deluxe Ice Cream", "", "on Sargento® Blends™ Slices", "on 2 Hallmark Cards", "on...
                                                                  $ redemptionsAllowed     <int> -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 5, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
                                                                  $ value                  <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 20.00, 0.75, 1.00, 0.50, 1.25, 1.00, 0.50, 1.49, 1.00, 1.00, 1.00, 0.75, 2.00, 0.50, 0.50, 1.00, 1.00, ...

Scraping JavaScript Website in R