Iterating rvest scrape function gives: Error in open.connection(x, rb) : Timeout was reached
With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:
d <- vector("list", length(links))
Here I do a for-loop, with a tryCatch
block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter
that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d)))
so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.
for (i in seq_along(links)) {
if (!(links[i] %in% names(d))) {
cat(paste("Doing", links[i], "..."))
ok <- FALSE
counter <- 0
while (ok == FALSE & counter <= 5) {
counter <- counter + 1
out <- tryCatch({
scrape_test(links[i])
},
error = function(e) {
Sys.sleep(2)
e
}
)
if ("error" %in% class(out)) {
cat(".")
} else {
ok <- TRUE
cat(" Done.")
}
}
cat("\n")
d[[i]] <- out
names(d)[i] <- links[i]
}
}
Web Scraping in R Timeout
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)
Error: Port 443 Timed Out - Scraping Data
You may need to set an explicit timeout on slower connections:
library(httr)
library(rvest)
pg <- GET("https://inciweb.nwcg.gov/", timeout(60))
incidents <- html_table(content(pg))[[1]]
str(incidents)
## 'data.frame': 10 obs. of 7 variables:
## $ Incident: chr "Highline Fire" "Cottonwood Fire" "Rattlesnake Point Fire" "Coolwater Complex" ...
## $ Type : chr "Wildfire" "Wildfire" "Wildfire" "Wildfire" ...
## $ Unit : chr "Payette National Forest" "Elko District Office" "Nez Perce - Clearwater National Forests" "Nez Perce - Clearwater National Forests" ...
## $ State : chr "Idaho, USA" "Nevada, USA" "Idaho, USA" "Idaho, USA" ...
## $ Status : chr "Active" "Active" "Active" "Active" ...
## $ Acres : chr "83,630" "1,500" "4,843" "2,969" ...
## $ Updated : chr "1 min. ago" "1 min. ago" "3 min. ago" "5 min. ago" ...
Temporary Workaround
l <- charToRaw(paste0(readLines("https://inciweb.nwcg.gov/"), collapse="\n"))
pg <- read_html(l)
html_table(pg)[[1]]
Loop to wait for result or timeout in r
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.
Related Topics
Rmarkdown Removes Citation Hyperlink
Unzip Password Protected Zip Files in R
Inserting a Table Under the Legend in a Ggplot2 and Saving Everything to a File
R - Min, Max and Mean of Off-Diagonal Elements in a Matrix
Tiny Plot Output from Sankeynetwork (Networkd3) in Firefox
Nan Is Removed When Using Na.Rm=True
How to Read a Subset of Large Dataset in R
How to Flatten R Data Frame That Contains Lists
Maintaining an Input/Output Log in R
How to Install Tidyverse on Ubuntu 16.04 and 17.04
Specify Function Parameters in Do.Call
Transfer Data from Database to Spark Using Sparklyr
Aggregating Values on a Data Tree with R
Finding the Index of First Changes in the Elements of a Vector
Plotting Pie Charts in Ggplot2
Dplyr:How to Find the First-Non Missing String by Groups