Using Trycatch and Rvest to Deal with 404 and Other Crawling Errors

Using tryCatch and rvest to deal with 404 and other crawling errors

You're looking for try or tryCatch, which are how R handles error catching.

With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:

library(rvest)

sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"

However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:

sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA

There we go; much better.


Update

In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:

library(tidyverse)
library(rvest)

df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.

df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>

How do I correctly use tryCatch() and ignore 404 errors in this rvest function?

As mentioned by User @27ϕ9, the trcyCatch() was not used correctly. The error handling needed to be outside the closing brace:

tryCatch( { 
#open connection to url
for_html_code <-read_html(paste_url)
for_lyrics <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[5]")
for_albums <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[11]/div[1]/b")
}, error = function(e){NA}
)

For more information, refer to this answer here.

Webscraping from list of URLs in dataframe in R

You can achieve that easily by the following approach:

  1. Create a function of your scraping part.
  2. Within this function you try the first Xpath, if the result is empty you try teh second Xpath
  3. You use any form of loop to repeat this task for all urls. (I used purrr::map but any loop would do)
library(rvest)

get_channel <- function(url) {
## some elements do not contain any url
if (!nchar(url)) return(NA_character_)
page <- url %>%
read_html()
## try to read channel
channel <- page %>%
html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>%
html_text()
## if it's empty we are most likely on an episode page -> try the other xpath
if (!length(channel)) {
channel <- page %>%
html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>%
html_text()
}
ifelse(length(channel), channel, NA_character_)
}

## loop through all urls in the df

purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
# [1] "BBC Two" "BBC Three" "BBC Three" "BBC Three" NA NA "BBC Three" "BBC Three" "BBC Three" "BBC Two"

To your other questions:

  1. It could be that BBC tries to prevent you from scraping their page. There are some tricks to get around this, like adding delays between consecutive requests. Sometimes the webpages look for the User agent and you need to change that every n requests such that the website does not block you. There are several ways of how websites try to protect themselves from webscraping and it depends from cae to case what you need to do. Having said that, I do not believe that 44k requests even come close to kill their service, but I am not an expert here.
  2. It definitely makes sense to avoid requesting the duplicated urls, and this can be easily achieved by [untested]:

    new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
    new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url),
    get_channel)
    dplyr::left_join(df,
    new_df[, c("Programme_Synopsis_url", "channel")],
    by = "Programme_Synopsis_url")

loop across multiple urls in r with rvest

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS

tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.

library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)

##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.

##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
next})


Related Topics



Leave a reply



Submit