Using tryCatch and rvest to deal with 404 and other crawling errors
You're looking for try
or tryCatch
, which are how R handles error catching.
With try
, you just need to wrap the thing that might fail in try()
, and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch
allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr
package offers two functions, safely
and possibly
, which work like try
and tryCatch
. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
How do I correctly use tryCatch() and ignore 404 errors in this rvest function?
As mentioned by User @27ϕ9, the trcyCatch()
was not used correctly. The error handling needed to be outside the closing brace:
tryCatch( {
#open connection to url
for_html_code <-read_html(paste_url)
for_lyrics <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[5]")
for_albums <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[11]/div[1]/b")
}, error = function(e){NA}
)
For more information, refer to this answer here.
Webscraping from list of URLs in dataframe in R
You can achieve that easily by the following approach:
- Create a function of your scraping part.
- Within this function you try the first Xpath, if the result is empty you try teh second Xpath
- You use any form of loop to repeat this task for all urls. (I used
purrr::map
but any loop would do)
library(rvest)
get_channel <- function(url) {
## some elements do not contain any url
if (!nchar(url)) return(NA_character_)
page <- url %>%
read_html()
## try to read channel
channel <- page %>%
html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>%
html_text()
## if it's empty we are most likely on an episode page -> try the other xpath
if (!length(channel)) {
channel <- page %>%
html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>%
html_text()
}
ifelse(length(channel), channel, NA_character_)
}
## loop through all urls in the df
purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
# [1] "BBC Two" "BBC Three" "BBC Three" "BBC Three" NA NA "BBC Three" "BBC Three" "BBC Three" "BBC Two"
To your other questions:
- It could be that BBC tries to prevent you from scraping their page. There are some tricks to get around this, like adding delays between consecutive requests. Sometimes the webpages look for the User agent and you need to change that every
n
requests such that the website does not block you. There are several ways of how websites try to protect themselves from webscraping and it depends from cae to case what you need to do. Having said that, I do not believe that 44k requests even come close to kill their service, but I am not an expert here. It definitely makes sense to avoid requesting the duplicated urls, and this can be easily achieved by [untested]:
new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url),
get_channel)
dplyr::left_join(df,
new_df[, c("Programme_Synopsis_url", "channel")],
by = "Programme_Synopsis_url")
loop across multiple urls in r with rvest
You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html()
requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site
list with lapply
then bind all dfs together:
jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")
dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})
finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS
tryCatch function works on most non-existent URLs, but it does not work in (at least) one case
Yes, you need to wrap a tryCatch
around the read_html
call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next
to tell R to skip to the next iteration of the loop.
library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)
##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.
##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
next})
Related Topics
R Dplyr Filter Not Masking Base Filter
Figure Out What Version of R a Function Was Introduced In
What Does the Double Percentage Sign (%%) Mean
Adding Elements to a List in for Loop in R
Group by in R, Ddply with Weighted.Mean
Data.Table in R - Multiple Filters Using Multiple Keys - Binary Search
Run a Custom Function on a Data Frame in R, by Group
How to Generate Ascii "Graphical Output" from R
How to Put Exact Number of Decimal Places on Label Ggplot Bar Chart
Function for Retrieving Own Ip Address from Within R
Using ':=' in Data.Table to Sum the Values of Two Columns in R, Ignoring Nas
Number Formatting Axis Labels in Ggplot2
Check If R Is Running in Rstudio
How to Add an Inset (Subplot) to "Topright" of an R Plot
Average Values of a Point Dataset to a Grid Dataset