Scraping leaderboard table on golf website in R
As already mentioned, this page is dynamically generated by some javascript
.
Even the json
file address seems to be dynamic, and the address you're trying to open isn't valid anymore :
https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9
An error occurred while processing your request.
Reference #199.cf05d517.1613439313.4ed8cf21
To get the data, you could use RSelenium after installing a Docker Selenium server.
The installation is straight forward, and Docker
is designed to make images work out of the box.
After Docker
installation, running the Selenium
server is as simple as:
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
Note that this as a whole requires over 2 Gb
disk space.
Selenium
emulates a Web browser and allows among others to get the final HTML
content of the page, after rendering of the javascript
:
library(RSelenium)
library(rvest)
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
# Open connexion to Selenium server
remDr$open()
remDr$getStatus()
remDr$navigate("https://www.pgatour.com/leaderboard.html")
players <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".player-name-col") %>%
html_text()
total <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".total") %>%
html_text()
data.frame(players = players, total = total[-1])
players total
1 Daniel Berger (PB) -18
2 Maverick McNealy (PB) -16
3 Patrick Cantlay (PB) -15
4 Jordan Spieth (PB) -15
5 Paul Casey (PB) -14
6 Nate Lashley (PB) -14
7 Charley Hoffman (PB) -13
8 Cameron Tringale (PB) -13
...
As the table doesn't use the table
tag, html_table
doesn't work and columns need to be extracted individually.
Scraping a Dynamic Webpage in R
From opening dev tools and then clicking on each tab within source webpage for player stats
and course stats
you see the following APIs calls which return json.
library(jsonlite)
stats <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/players?region=uk&lang=en&event=401056558')
course <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/course?region=uk&lang=en&event=401056558')
In R, use rvest and xml2 to extract JSON object from a script element on website
This site generates the hmac
and expire
url parameters value from a JS function that is using a specific algorithm. The arguments of this algorithm are depending on the epoch time which is passed as url parameter to the JS file hosting that function here. This way, the hmac
value is different each time because it's processed from this file whose url is changing constantly.
This algorithm consists of bitwise and & xor like this (pseudocode):
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
The xorKey
numbers are generated dynamically on https://microservice.pgatour.com/js
based on epoch time. You just need to request this js file with the current epoch time as url parameter and extract with regex all stepValues
that are required in the above algorithm (starting with -1
). You will also need to reproduce the alogithm above in r
The following script generates the url parameters and makes the API call:
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract
How to find this algorithm ?
I've searched in the Javascript code and reversed the obfuscated code to be decoded into something understandable. This is quite a long way to go. Let's go there step by step.
Mission n°1 - search for leaderboardUrl
You've given the first hint in your question, the location of the config
where there is a leaderboardUrl
.
There is this JS file named stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js
that have occurences of leaderboardUrl
in config.leaderboardUrl
:
{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}
Let's look at performFetch
function that seems to send the request
{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................
We've spotted the getUrlWithAuth
function:
{
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},
Now, we have getUserId
and getTrackingUserIdParam
that look like the function and variable adding the authorization parameters to the url. The problem is we have to find where is this function located.
Mission n°2 - Deobfuscation challenge: substitutions
I've spotted this file named main.c03ddfd249437fcce43410c35a21c6f8.js
where there is an occurence of getUserId
and getTrackingUserIdParam
:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},
I've skipped a lot of code in the above snippet so it's more clear.
You can see that there are substitutions here, using the t
array as a base, it will offset the strings using the A
function and there is an init function that updated the initial t
array so that it decodes to the right strings
You can paste this snippet into a nodejs script, modify it a little and then you can use something like:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);
// prints e[pgatour] = e[pgatour] || {};
Here e
is window
so you "just" have to substitute all the A(XXX)
in order to understand better what is going on.
You would spot this:
onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},
which when decoded gives something like:
onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},
The function we are looking for is window["pgatour"]["setTrackingUserId"]
. But we could have known this since mission n°1. Remember in the first JS file:
var t = u.setTrackingUserId
and u
being window.pgatour
But here, we have I
the input parameter that is hard coded :
var I = A(428) + A(423) + A(407);
which is equivalent to var I = "id8730931"
Now let's look at window["pgatour"]["setTrackingUserId"]
function
Mission n°3 - Crypto/reverse
Open chrome developer console on the website, paste window["pgatour"]["setTrackingUserId"]
you will get something like this:
function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
Yes :( again more obfuscated code to deal with
By looking at the application script, you may find that it's located in this file. This is the JS file url:
https://microservice.pgatour.com/js?_=1618868625306
There is an url parameter specifying an epoch time and the code changes depending on this parameter
Looking at the code itself, we get something like this after substituing the input parameters which are String.fromCharCode
and Math.abs
((function($__$, _, $_$) {
var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
......
_$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
___ += _(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -33221,
___ += _(_$_),
.....
$__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
var $$ = _$__(_$_$);
var _$_, ___, __;
var __ = (__ = 101,
___ = String.fromCharCode(__),
_$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
___ += String.fromCharCode(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -1798324966,
___ += String.fromCharCode(_$_),
__ = _$_,
....
__ = _$_,
___);
return __
}
);
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));
We can make a nodejs script to reproduce this algorithm in a simpler way by extracting the step value (in the xor stage):
const axios = require("axios");
const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);
(async function () {
const response = await axios.get(
"https://microservice.pgatour.com/js?_=1618868625506"
);
data = response.data.match(/-17\d+/g).map((it) => parseInt(it));
for (t of data) {
var step = ((value * value - encodedId) & init) ^ t;
result += String.fromCharCode(step);
value = step;
}
console.log(result);
})();
output:
exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
If you change the epoch time, it will give a different result
repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt
Then you just need to convert this nodejs script in r and make your http call with the url parameters
Note that encodedId
comes from the input id id8730931
converted using this function (those values don't seem to change with the epoch time):
var $$_ = 4294967295;
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
My guess is that the server is checking that the hmac is correctly referring to the initial id string id8730931
so it's safe to harcode (since it's also harcoded in the server)
Confusion Regarding HTML Code For Web Scraping With R
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Scraping html header with R
Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.
That said, the following code will get what you want from a NOAA json in a data frame.
library(tidyverse)
library(jsonlite)
j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)
j1$stations %>% as_tibble() %>% select(name, state, id)
Results
# A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150
How can I scrape this data?
I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...
content(a)
gives you a list representation... basically the output from fromJSON()
oras(a, "character")
gives you the raw JSON
library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")
Related Topics
How to Convert Numeric Values to Time Without the Date
Delete Rows with Less Than 7 Characters
Plot Scatterplot on a Map in Shiny
Aggregating Unique Values in Columns to Single Dataframe "Cell"
What Is Your Preferred Style for Naming Variables in R
Hyperlink Bar Chart in Highcharter
Fixing a Multiple Warning "Unknown Column"
Topic Models: Cross Validation with Loglikelihood or Perplexity
Is There a Difference Between the R Functions Fitted() and Predict()
How to Get the Second Sub Element of Every Element in a List
Regression with Heteroskedasticity Corrected Standard Errors
How to Run a High Pass or Low Pass Filter on Data Points in R
How to Sum Data.Frame Column Values
Arranging Rows in Custom Order Using Dplyr
Ggplot2 PDF Import in Adobe Illustrator Missing Font Adobepistd