Scraping JavaScript Website in R

Scraping javascript website in R

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn't require java):

library(rvest)

# render HTML from the site with phantomjs

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

## [1] "10:20 AM, October 28, 2014"

Web scraping with R and rvest when javascript-rendered content in the web page

Here's what it would look like using RSelenium to get the page to load.

library(rvest)
library(RSelenium)
remDr <- rsDriver(browser='chrome', port=4444L)
brow <- remDr[["client"]]
brow$open()
brow$navigate("https://www.filmweb.no/kinotoppen/")
h <- brow$getPageSource()
h <- read_html(h[[1]])
h %>% html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>%
html_text()
# [1] "Spider-Man: No Way Home" "Clifford: Den store røde hunden" "Lise & Snøpels - Venner for alltid"
# [4] "Familien Voff - alle trenger en venn" "Nightmare Alley" "Snødronningen"
# [7] "Scream" "Bergman Island" "Trøffeljegerne fra Piemonte"
# [10] "Encanto"

Reading the content of a Javascript-rendered webpage into R

Data is coming from an API call returning json. You can make the same GET request and then extract the usernames. Swop x$UserName with x$CustomerId for ids.

library(jsonlite)

data <- jsonlite::read_json('https://www.etoro.com/sapi/rankings/rankings/?activeweeksmin=24&blocked=false&bonusonly=false&copiersmax=5000©block=false©investmentpctmax=0©tradespctmax=0&dailyddmin=-10&displayfullname=true&gainmax=100&gainmin=5&hasavatar=true&highleveragepctmax=10&isfund=false&istestaccount=false&lastactivitymax=14&longpospctmax=80&lowleveragepctmin=50&maxdailyriskscoremax=5&maxmonthlyriskscoremax=5&maxmonthlyriskscoremin=1&optin=true&page=1&pagesize=20&period=OneYearAgo&profitableweekspctmin=50&sort=-gain&tradesmin=20&verified=true&weeklyddmin=-20&winratiomax=85')

users <- lapply(data$Items, function(x) {x$UserName})

Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

You are right - the contents on the page are updated by javascript via an ajax request. The server returns a json string in response to an http POST request. With POST requests, the server's response is determined not only by the url you request, but by the body of the message you send to the server. In this case, your body is a simple form with 3 fields: gameName, which is always LOTTO, isAjax which is always true, and drawNumber, which is the field you want to vary.

If you are using httr, you specify these fields as a named list in the body parameter of the POST function.

Once you have the response for each draw, you will want to parse the json into an R-friendly format such as a list or data frame using a library such as jsonlite. From looking at the structure of this particular json, it makes most sense to extract the component $data$drawDetailsand make that a one-row dataframe. This will allow you to bind several draws together into a single data frame.

Here is a function that does all that for you:

lotto_details <- function(draw_numbers)
{
do.call("rbind", lapply(draw_numbers, function(x)
{
res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
"?task=results.redirectPageURL&",
"Itemid=265&option=com_weaver&",
"controller=lotto-history"),
body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
}))
}

Which you use like this:

lotto_details(2009:2012)
#> drawNumber drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1 2009 2020/04/01 2020/04/04 51 15 7 32 42 45
#> 2 2010 2020/04/04 2020/04/08 43 4 21 24 10 3
#> 3 2011 2020/04/08 2020/04/11 42 43 8 18 2 29
#> 4 2012 2020/04/11 2020/04/15 48 6 43 41 25 45
#> bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1 1 0 0 0 0 21
#> 2 22 0 0 0 0 31
#> 3 34 0 0 0 0 21
#> 4 38 1 10546013.8 0 0 28
#> div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1 8455.3 60 2348.7 1252 189 1786
#> 2 6004.3 71 2080.6 1808 137.3 2352
#> 3 8584.5 60 2384.6 1405 171.1 2079
#> 4 7676.4 62 2751.4 1389 206.3 1872
#> div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1 115.2 24664 50 19711 20 3809758.17
#> 2 91.7 35790 50 25981 20 5966533.86
#> 3 100.5 27674 50 21895 20 8055430.87
#> 4 133 28003 50 20651 20 0
#> rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1 2 6198036.67 9879655 6000000
#> 2 3 9073426.56 11696905 8000000
#> 3 4 10649716.37 10406895 10000000
#> 4 0 13280236.5 11610950 2000000
#> guaranteedJackpot drawMachine ballSet status winners millionairs
#> 1 0 RNG2 RNG published 47494 0
#> 2 0 RNG2 RNG published 66033 0
#> 3 0 RNG2 RNG published 53134 0
#> 4 0 RNG2 RNG published 52006 1
#> gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1 47494 0 0 0 0 0 0
#> 2 66033 0 0 0 0 0 0
#> 3 53134 0 0 0 0 0 0
#> 4 52006 0 0 0 0 0 0
#> kznwinners nwwinners
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0

Created on 2020-04-13 by the reprex package (v0.3.0)

R / Rvest / RSelenium: scrape data from JS Sites

I've had no problems using RSelenium with the help of the wdman package, which allowed me to just not bother with Docker. wdman also fetches all binaries you need if they aren't already available. It's nice magic.

Here's a simple script to spin up a Selenium instance with Chrome, open a site, get the contents as xml and then close it all down again.

library(wdman)
library(RSelenium)
library(xml2)

# start a selenium server with wdman, running the latest chrome version
selServ <- wdman::selenium(
port = 4444L,
version = 'latest',
chromever = 'latest'
)

# start your chrome Driver on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)

# open a selenium browser tab
remDr$open()

# navigate to your site
remDr$navigate(some_url)

# get the html contents of that site as xml tree
page_xml <- xml2::read_html(remDr$getPageSource()[[1]])

# do your magic
# ... check doc at `?remoteDriver` to see what your remDr object can help you do.

# clean up after you
remDr$close()
selServ$stop()

How to scrape this website in R using rvest?

You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:

library(rvest)
library(magrittr)

r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>%
session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')

tables <- r %>% read_html() %>% html_table()

web scraping Javascript heavy site using rvest

There an API which you can use,

library(jsonlite)
df = fromJSON('https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch.php?ids=')
head(df$markers)
lat lng type
1 -11.78449871 -70.73347813 Forest and land-use carbon
2 17.067346 94.459977 Forest and land-use carbon
3 3.054216 -72.333984 Forest and land-use carbon
4 20.98685 -89.03344 Forest and land-use carbon
5 -0.886093 30.5798 Forest and land-use carbon
6 -1.809978 31.131299 Forest and land-use carbon
title location pid size
1 Reforestadores REDD Project Madre de Dios, Peru 1 85000
2 Reforestation and Restoration of degraded mangrove lands, sustainable livelihood and community development in Myanmar Myanmar 2 2575
3 San Nicolas Carbon Sequestration Project San Nicholas, Colombia 3 7300
4 Amigos de Calakmul Mexico Selva Maya, Mexico 4 56700
5 Uganda Nile Basin Reforestation Project No 4 Uganda 5 347
6 Emiti Nibwo Bulora Nyaishozi, Tanzania 6 130

Scraping Javascript rendered content using R

Though it uses javascript, it sends JSON. You can aviod using javascript by using the hidden api:

library(rvest)
library(jsonlite)
my_url <- "https://www.kroger.com/cl/api/coupons?couponsCountPerLoad=418&sortType=relevance&newCoupons=false" #hidden api
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("p") %>% html_text()
data <- fromJSON(content)
mydata <- data$data$coupons

> glimpse(mydata)
Observations: 418
Variables: 19
$ id <int> 2149194, 2149191, 2127870, 2129277, 2128587, 2126349, 2121480, 2128278, 2157633, 2169615, 2159613, 2140047, 2159769, 2167485, 2141526...
$ brandName <chr> "Other", "Other", "Store Brand", "Store Brand", "Store Brand", "Store Brand", "Sargento", "Hallmark", "Colgate", "Oscar Mayer", "Kett...
$ longDescription <chr> "Selling or purchasing fuel points is prohibited. Fuel redemption offer cannot be combined with any other discounts. No discounts to ...
$ shortDescription <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save $0.50 on 2 Kroger...
$ requirementDescription <chr> "when you buy a participating gift card. *Restrictions apply, see store for details.", "when you buy a $25, $50 or $100 Mastercard® o...
$ categories <list> ["Gift Cards", "Gift Cards", "General", "Snacks", <"Promotions", "Frozen">, "General", "Dairy", "General", <"Baking Goods", "Health ...
$ expirationDate <chr> "2018-05-13T04:00:00Z", "2018-05-13T04:00:00Z", "2018-07-29T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-29T0...
$ lastRedemptionDate <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...
$ displayStartDate <chr> "2018-05-07T04:00:00Z", "2018-05-07T04:00:00Z", "2018-04-30T04:00:00Z", "2018-04-18T04:00:00Z", "2018-04-18T04:00:00Z", "2018-05-02T0...
$ imageUrl <chr> "https://cdnws.softcoin.com/mediaCache/ecoupon_1585374.png", "https://cdnws.softcoin.com/mediaCache/ecoupon_1585365.png", "https://cd...
$ krogerCouponNumber <chr> "800000013010", "800000013711", "10000008220", "800000012111", "800000012554", "800000014782", "800000015150", "800000022503", "80000...
$ addedToCard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ canBeAddedToCard <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
$ canBeRemoved <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
$ filterTags <list> [<"FT4XGRAD", "FTBL4XGRADFM", "FTBL4XGRAD", "FTBL4XMOMGC", "FTBL4XMOM1", "4XGCWEEKEND", "FTBL4XGRAD2", "KPF">, <"FTBL4XGRAD1", "4XGC...
$ title <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save 50¢", "Save 50¢",...
$ displayDescription <chr> "", "", "", "on 2 Kroger Potato Chips", "on 2 Kroger Deluxe Ice Cream", "", "on Sargento® Blends™ Slices", "on 2 Hallmark Cards", "on...
$ redemptionsAllowed <int> -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 5, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ value <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 20.00, 0.75, 1.00, 0.50, 1.25, 1.00, 0.50, 1.49, 1.00, 1.00, 1.00, 0.75, 2.00, 0.50, 0.50, 1.00, 1.00, ...


Related Topics



Leave a reply



Submit