What's My User Agent When I Parse Website with Rvest Package in R

What's my user agent when I parse website with rvest package in R?

I used https://httpbin.org/user-agent to find out:

library(rvest)
se <- html_session( "https://httpbin.org/user-agent" )
se$response$request$options$useragent

Answer:

[1] "libcurl/7.37.1 r-curl/0.9.1 httr/1.0.0"

See this bug report for a way to override it.

Change user agent when using rvest::read_html

Note: rvest and xml2 use httr under the hood, so I'll introduce httr in my answer here.

As you note in your post, dynamically setting the User Agent is very straightforward when using the httr package. As an example I'll use the link you listed above:

library(httr)

# Let's set user agent to a super common one
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"

# Query webpage
bbc <- GET("https://www.bbc.com/",
user_agent(ua))

# Confirm it's actually used the desired user agent
bbc$request$options$useragent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"

Now you can compare the User Agent value when using the httr defaults:

library(httr)

# Query webpage with default user agent
bbc <- GET("https://www.bbc.com/")

# Print default user agent value
bbc$request$options$useragent
#> [1] "libcurl/7.64.1 r-curl/4.3 httr/1.4.2"

Obviously, you can set the User Agent to whatever you want. Here is a list of common User Agents.

Pass user_agent() parameter in read_html()

You're almost there. You need to do user_agent = "user agent", not user_agent("user-agent")

Here's a reprex to demonstrate:

library(httr)
library(rvest)
#> Loading required package: xml2

parse_rvest <- read_html("http://testing-ground.scraping.pro/",
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
parse_rvest
#> {html_document}
#> <html class="no-js">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n\t\t<script type="text/javascript">\n\t\t\n\t\t var _gaq = _gaq ...

Created on 2020-03-02 by the reprex package (v0.3.0)

Which selector to write in rvest package in R?

  1. Extract the JSON object from the element's text (tidy the selector up while you're at it)

  2. Parse it as a list using jsonlite's fromJSON() function.

  3. You can access it directly using "$ctags"

    library(jsonlite)

    json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
    html_nodes("script:contains('var wp_dot_addparams')") %>%
    gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
    fromJSON()

    json$ctags

    [1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"

403 Error When Using Rvest to Log Into Website For Scraping

Using R.S.'s suggestion, I used RSelenium to log in successfully.

A quick note for fellow mac users on using either chrome or phantom. I am running El Capitan so had some issue getting the mac to recognize the paths to both of the bin files. Instead, I moved the bin files to /usr/local/bin and they ran without an issue.

Below is the code to do so:

library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)

This can also be done with phantom,

library(RSelenium)

pJS <- phantom() # start phantomjs

appURL <- 'https://www.optionslam.com/accounts/login/'
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)

Rvest scraping child nodes but filling missing values with NA

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.

Swap out html_elements for html_element.

You also need to amend your xpaths to avoid getting the first node value repeated for each row.

library(tidyverse)
library(httr)

headers <- c("User-Agent" = "Safari/537.36")

r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))

r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)

In R extract a declared variable from html

You can evaluate this via the excelent V8 package as follows:

require(rvest)
require(V8)
txt <- "<!DOCTYPE html>
<html>
<body>

<script>
var global_tmp_status = 0;
var global_goal_scored_overtime = [ ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'], ['14/8/2016', 1, 0, 2, 0, 3]];
</script>

</body>
</html>"
# probably you need another selector to "find" your script...
script <- read_html(txt) %>% html_node("script") %>% html_text(trim=TRUE)
ctx <- v8()
ctx$eval(script)
ctx$get("global_tmp_status")
ctx$get("global_goal_scored_overtime")

Resulting in:

> ctx$get("global_tmp_status")
[1] 0

and

> ctx$get("global_goal_scored_overtime")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "x" "Headed" "Left foot" "Right foot" "Other" "Overall"
[2,] "14/8/2016" "1" "0" "2" "0" "3"


Related Topics



Leave a reply



Submit