How to Set Timeout in Rvest

Iterating rvest scrape function gives: Error in open.connection(x, rb) : Timeout was reached

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                  scrape_test(links[i])
                },
                error = function(e) {
                  Sys.sleep(2)
                  e
                }
              )
      if ("error" %in% class(out)) {
        cat(".")
      } else {
        ok <- TRUE
        cat(" Done.")
      }
    }
    cat("\n")
    d[[i]] <- out
    names(d)[i] <- links[i]
  }
}

Web Scraping in R Timeout

An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)

library(magrittr)
library(httr)
library(stringr)

data  <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>% 
         content(as = "text")

ca <- data %>%  stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Error: Port 443 Timed Out - Scraping Data

You may need to set an explicit timeout on slower connections:

library(httr)
library(rvest)

pg <- GET("https://inciweb.nwcg.gov/", timeout(60))

incidents <- html_table(content(pg))[[1]]

str(incidents)
## 'data.frame': 10 obs. of  7 variables:
##  $ Incident: chr  "Highline Fire" "Cottonwood Fire" "Rattlesnake Point Fire" "Coolwater Complex" ...
##  $ Type    : chr  "Wildfire" "Wildfire" "Wildfire" "Wildfire" ...
##  $ Unit    : chr  "Payette National Forest" "Elko District Office" "Nez Perce - Clearwater National Forests" "Nez Perce - Clearwater National Forests" ...
##  $ State   : chr  "Idaho, USA" "Nevada, USA" "Idaho, USA" "Idaho, USA" ...
##  $ Status  : chr  "Active" "Active" "Active" "Active" ...
##  $ Acres   : chr  "83,630" "1,500" "4,843" "2,969" ...
##  $ Updated : chr  "1 min. ago" "1 min. ago" "3 min. ago" "5 min. ago" ...

Temporary Workaround

l <- charToRaw(paste0(readLines("https://inciweb.nwcg.gov/"), collapse="\n"))

pg <- read_html(l)

html_table(pg)[[1]]

Loop to wait for result or timeout in r

I think you may find this answer about the use of tryCatch useful

Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error

Hope it helps.

How to Set Timeout in Rvest

Iterating rvest scrape function gives: Error in open.connection(x, rb) : Timeout was reached

Web Scraping in R Timeout

Error: Port 443 Timed Out - Scraping Data

Loop to wait for result or timeout in r

Related Topics

Leave a reply