Scraping Leaderboard Table on Golf Website in R

Scraping leaderboard table on golf website in R

As already mentioned, this page is dynamically generated by some javascript.

Even the json file address seems to be dynamic, and the address you're trying to open isn't valid anymore :

https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9

An error occurred while processing your request.

Reference #199.cf05d517.1613439313.4ed8cf21

To get the data, you could use RSelenium after installing a Docker Selenium server.

The installation is straight forward, and Docker is designed to make images work out of the box.

After Docker installation, running the Selenium server is as simple as:

docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0

Note that this as a whole requires over 2 Gb disk space.

Selenium emulates a Web browser and allows among others to get the final HTML content of the page, after rendering of the javascript:

library(RSelenium)
library(rvest)

remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
# Open connexion to Selenium server
remDr$open()
remDr$getStatus()

remDr$navigate("https://www.pgatour.com/leaderboard.html")

players <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".player-name-col") %>%
html_text()

total <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".total") %>%
html_text()

data.frame(players = players, total = total[-1])

players total
1 Daniel Berger (PB) -18
2 Maverick McNealy (PB) -16
3 Patrick Cantlay (PB) -15
4 Jordan Spieth (PB) -15
5 Paul Casey (PB) -14
6 Nate Lashley (PB) -14
7 Charley Hoffman (PB) -13
8 Cameron Tringale (PB) -13
...

As the table doesn't use the table tag, html_table doesn't work and columns need to be extracted individually.

Scraping a Dynamic Webpage in R

From opening dev tools and then clicking on each tab within source webpage for player stats and course stats you see the following APIs calls which return json.

library(jsonlite)

stats <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/players?region=uk&lang=en&event=401056558')
course <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/course?region=uk&lang=en&event=401056558')

In R, use rvest and xml2 to extract JSON object from a script element on website

This site generates the hmac and expire url parameters value from a JS function that is using a specific algorithm. The arguments of this algorithm are depending on the epoch time which is passed as url parameter to the JS file hosting that function here. This way, the hmac value is different each time because it's processed from this file whose url is changing constantly.

This algorithm consists of bitwise and & xor like this (pseudocode):

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....

The xorKey numbers are generated dynamically on https://microservice.pgatour.com/js based on epoch time. You just need to request this js file with the current epoch time as url parameter and extract with regex all stepValues that are required in the above algorithm (starting with -1). You will also need to reproduce the alogithm above in r

The following script generates the url parameters and makes the API call:

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)

kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract

How to find this algorithm ?

I've searched in the Javascript code and reversed the obfuscated code to be decoded into something understandable. This is quite a long way to go. Let's go there step by step.

Mission n°1 - search for leaderboardUrl

You've given the first hint in your question, the location of the config where there is a leaderboardUrl.

There is this JS file named stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js that have occurences of leaderboardUrl in config.leaderboardUrl:

{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}

Let's look at performFetch function that seems to send the request

{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................

We've spotted the getUrlWithAuth function:

  {
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},

Now, we have getUserId and getTrackingUserIdParam that look like the function and variable adding the authorization parameters to the url. The problem is we have to find where is this function located.

Mission n°2 - Deobfuscation challenge: substitutions

I've spotted this file named main.c03ddfd249437fcce43410c35a21c6f8.js where there is an occurence of getUserId and getTrackingUserIdParam :

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},

I've skipped a lot of code in the above snippet so it's more clear.

You can see that there are substitutions here, using the t array as a base, it will offset the strings using the A function and there is an init function that updated the initial t array so that it decodes to the right strings

You can paste this snippet into a nodejs script, modify it a little and then you can use something like:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];

var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);

// prints e[pgatour] = e[pgatour] || {};

Here e is window so you "just" have to substitute all the A(XXX) in order to understand better what is going on.

You would spot this:

onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},

which when decoded gives something like:

onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},

The function we are looking for is window["pgatour"]["setTrackingUserId"]. But we could have known this since mission n°1. Remember in the first JS file:

var t = u.setTrackingUserId

and u being window.pgatour

But here, we have I the input parameter that is hard coded :

var I = A(428) + A(423) + A(407);

which is equivalent to var I = "id8730931"

Now let's look at window["pgatour"]["setTrackingUserId"] function

Mission n°3 - Crypto/reverse

Open chrome developer console on the website, paste window["pgatour"]["setTrackingUserId"] you will get something like this:

function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................

Yes :( again more obfuscated code to deal with

By looking at the application script, you may find that it's located in this file. This is the JS file url:

https://microservice.pgatour.com/js?_=1618868625306

There is an url parameter specifying an epoch time and the code changes depending on this parameter

Looking at the code itself, we get something like this after substituing the input parameters which are String.fromCharCode and Math.abs

((function($__$, _, $_$) { 
var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
......
_$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
___ += _(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -33221,
___ += _(_$_),
.....
$__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
var $$ = _$__(_$_$);
var _$_, ___, __;
var __ = (__ = 101,
___ = String.fromCharCode(__),
_$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
___ += String.fromCharCode(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -1798324966,
___ += String.fromCharCode(_$_),
__ = _$_,
....
__ = _$_,
___);
return __
}
);
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));

We can make a nodejs script to reproduce this algorithm in a simpler way by extracting the step value (in the xor stage):

const axios = require("axios");

const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);

(async function () {
const response = await axios.get(
"https://microservice.pgatour.com/js?_=1618868625506"
);
data = response.data.match(/-17\d+/g).map((it) => parseInt(it));

for (t of data) {
var step = ((value * value - encodedId) & init) ^ t;
result += String.fromCharCode(step);
value = step;
}
console.log(result);
})();

output:

exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd

If you change the epoch time, it will give a different result

repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt

Then you just need to convert this nodejs script in r and make your http call with the url parameters

Note that encodedId comes from the input id id8730931 converted using this function (those values don't seem to change with the epoch time):

var $$_ = 4294967295;
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}

My guess is that the server is checking that the hmac is correctly referring to the initial id string id8730931 so it's safe to harcode (since it's also harcoded in the server)

Confusion Regarding HTML Code For Web Scraping With R

The tricky part is determining the correct set of attributes to only select this one html node.

In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)

library(rvest)
url="https://finance.yahoo.com/quote/SPY"

#read page once
page <- read_html(url)

#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()

price

Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.

Scraping html header with R

Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.

That said, the following code will get what you want from a NOAA json in a data frame.

library(tidyverse)
library(jsonlite)

j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)

j1$stations %>% as_tibble() %>% select(name, state, id)

Results

    # A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150

How can I scrape this data?

I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...

content(a) gives you a list representation... basically the output from fromJSON()
or

as(a, "character") gives you the raw JSON

library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")


Related Topics



Leave a reply



Submit