What's my user agent when I parse website with rvest package in R?
I used https://httpbin.org/user-agent to find out:
library(rvest)
se <- html_session( "https://httpbin.org/user-agent" )
se$response$request$options$useragent
Answer:
[1] "libcurl/7.37.1 r-curl/0.9.1 httr/1.0.0"
See this bug report for a way to override it.
Change user agent when using rvest::read_html
Note: rvest
and xml2
use httr
under the hood, so I'll introduce httr
in my answer here.
As you note in your post, dynamically setting the User Agent is very straightforward when using the httr
package. As an example I'll use the link you listed above:
library(httr)
# Let's set user agent to a super common one
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
# Query webpage
bbc <- GET("https://www.bbc.com/",
user_agent(ua))
# Confirm it's actually used the desired user agent
bbc$request$options$useragent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
Now you can compare the User Agent value when using the httr
defaults:
library(httr)
# Query webpage with default user agent
bbc <- GET("https://www.bbc.com/")
# Print default user agent value
bbc$request$options$useragent
#> [1] "libcurl/7.64.1 r-curl/4.3 httr/1.4.2"
Obviously, you can set the User Agent to whatever you want. Here is a list of common User Agents.
Pass user_agent() parameter in read_html()
You're almost there. You need to do user_agent = "user agent"
, not user_agent("user-agent")
Here's a reprex to demonstrate:
library(httr)
library(rvest)
#> Loading required package: xml2
parse_rvest <- read_html("http://testing-ground.scraping.pro/",
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
parse_rvest
#> {html_document}
#> <html class="no-js">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n\t\t<script type="text/javascript">\n\t\t\n\t\t var _gaq = _gaq ...
Created on 2020-03-02 by the reprex package (v0.3.0)
Which selector to write in rvest package in R?
Extract the JSON object from the element's text (tidy the selector up while you're at it)
Parse it as a list using jsonlite's fromJSON() function.
You can access it directly using "$ctags"
library(jsonlite)
json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script:contains('var wp_dot_addparams')") %>%
gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
fromJSON()
json$ctags
[1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"
403 Error When Using Rvest to Log Into Website For Scraping
Using R.S.'s suggestion, I used RSelenium to log in successfully.
A quick note for fellow mac users on using either chrome or phantom. I am running El Capitan so had some issue getting the mac to recognize the paths to both of the bin files. Instead, I moved the bin files to /usr/local/bin and they ran without an issue.
Below is the code to do so:
library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))
appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)
This can also be done with phantom,
library(RSelenium)
pJS <- phantom() # start phantomjs
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))
appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)
Rvest scraping child nodes but filling missing values with NA
If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.
Swap out html_elements
for html_element
.
You also need to amend your xpaths to avoid getting the first node value repeated for each row.
library(tidyverse)
library(httr)
headers <- c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))
r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)
In R extract a declared variable from html
You can evaluate this via the excelent V8
package as follows:
require(rvest)
require(V8)
txt <- "<!DOCTYPE html>
<html>
<body>
<script>
var global_tmp_status = 0;
var global_goal_scored_overtime = [ ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'], ['14/8/2016', 1, 0, 2, 0, 3]];
</script>
</body>
</html>"
# probably you need another selector to "find" your script...
script <- read_html(txt) %>% html_node("script") %>% html_text(trim=TRUE)
ctx <- v8()
ctx$eval(script)
ctx$get("global_tmp_status")
ctx$get("global_goal_scored_overtime")
Resulting in:
> ctx$get("global_tmp_status")
[1] 0
and
> ctx$get("global_goal_scored_overtime")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "x" "Headed" "Left foot" "Right foot" "Other" "Overall"
[2,] "14/8/2016" "1" "0" "2" "0" "3"
Related Topics
Data.Table Alternative for Dplyr Case_When
Creating Professional Looking Powerpoints in R
How to Find Useful R Tutorials with Various Implementations
Centering Image and Text in R Markdown for a PDF Report
Ggplot2 - Shade Area Between Two Vertical Lines
How to Swap Columns Around in a Data Frame Using R
How to Save Summary(Lm) to a File
Change Color of Only One Bar in Ggplot
Center-Align Legend Title and Legend Keys in Ggplot2 for Long Legend Titles
Writing to a Dataframe from a For-Loop in R
Creating a Facet_Wrap Plot with Ggplot2 with Different Annotations in Each Plot
How to Extend '==' Behavior to Vectors That Include Nas
R View() Does Not Display All Columns of Data Frame
Cumulative Sum for Positive Numbers Only
Subscripts and Superscripts "-" or "+" with Ggplot2 Axis Labels? (Ionic Chemical Notation)