How to Set Timeout in Rvest

Iterating rvest scrape function gives: Error in open.connection(x, rb) : Timeout was reached

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
if (!(links[i] %in% names(d))) {
cat(paste("Doing", links[i], "..."))
ok <- FALSE
counter <- 0
while (ok == FALSE & counter <= 5) {
counter <- counter + 1
out <- tryCatch({
scrape_test(links[i])
},
error = function(e) {
Sys.sleep(2)
e
}
)
if ("error" %in% class(out)) {
cat(".")
} else {
ok <- TRUE
cat(" Done.")
}
}
cat("\n")
d[[i]] <- out
names(d)[i] <- links[i]
}
}

Web Scraping in R Timeout

An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)

library(magrittr)
library(httr)
library(stringr)

data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")

ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Error: Port 443 Timed Out - Scraping Data

You may need to set an explicit timeout on slower connections:

library(httr)
library(rvest)

pg <- GET("https://inciweb.nwcg.gov/", timeout(60))

incidents <- html_table(content(pg))[[1]]

str(incidents)
## 'data.frame': 10 obs. of 7 variables:
## $ Incident: chr "Highline Fire" "Cottonwood Fire" "Rattlesnake Point Fire" "Coolwater Complex" ...
## $ Type : chr "Wildfire" "Wildfire" "Wildfire" "Wildfire" ...
## $ Unit : chr "Payette National Forest" "Elko District Office" "Nez Perce - Clearwater National Forests" "Nez Perce - Clearwater National Forests" ...
## $ State : chr "Idaho, USA" "Nevada, USA" "Idaho, USA" "Idaho, USA" ...
## $ Status : chr "Active" "Active" "Active" "Active" ...
## $ Acres : chr "83,630" "1,500" "4,843" "2,969" ...
## $ Updated : chr "1 min. ago" "1 min. ago" "3 min. ago" "5 min. ago" ...

Temporary Workaround

l <- charToRaw(paste0(readLines("https://inciweb.nwcg.gov/"), collapse="\n"))

pg <- read_html(l)

html_table(pg)[[1]]

Loop to wait for result or timeout in r

I think you may find this answer about the use of tryCatch useful

Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error

Hope it helps.



Related Topics



Leave a reply



Submit