Iterating rvest scrape function gives: Error in open.connection(x, rb ) : Timeout was reached
With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:
d <- vector("list", length(links))
Here I do a for-loop, with a tryCatch
block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter
that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d)))
so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.
for (i in seq_along(links)) {
if (!(links[i] %in% names(d))) {
cat(paste("Doing", links[i], "..."))
ok <- FALSE
counter <- 0
while (ok == FALSE & counter <= 5) {
counter <- counter + 1
out <- tryCatch({
scrape_test(links[i])
},
error = function(e) {
Sys.sleep(2)
e
}
)
if ("error" %in% class(out)) {
cat(".")
} else {
ok <- TRUE
cat(" Done.")
}
}
cat("\n")
d[[i]] <- out
names(d)[i] <- links[i]
}
}
Error in open.connection(x, rb ) : HTTP error 500 when using map_df
You can use purrr::possibly
:
url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'
library(tidyverse)
library(rvest)
map_df(0:573, possibly(~{
pagina <- read_html(sprintf(url_silla, .x, '%s', '%s', '%s', '%s'))
print(.x)
data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)
}, NULL)) -> noticias_silla
Error in open.connection(x, rb ) : HTTP error 404 in R
TL;DR
Add a slash:
details_url <- paste0("https://www.sgcarmart.com/new_cars/",html_attr(h,"href"))
# ---> ---> ---> ---> ^
The Journey
I ran your source far enough to get the
popularcars
, and looked at the first one:h <- popularcars[[1]]
h
# {html_node}
# <a href="newcars_overview.php?CarCode=12618" class="link">
# [1] <div style="position:relative; padding-bottom:6px;">\r\n < ...
# [2] <div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
details_url <- paste0("https://www.sgcarmart.com/new_cars",html_attr(h,"href"))
details_url
# [1] "https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618"Like you, for me that URL returned 404.
I navigated (in a boring, normal, browser) to the main URL, looked at the source for the page, and searched for
12618
:<div style="padding:10px 10px 5px 10px;" id="nc_popular_car">
<div class="floatleft" style="text-align:center;width:136px;padding-right:22px;">
<a href="newcars_overview.php?CarCode=12618" class="link">
<div style="position:relative; padding-bottom:6px;">
<div style="position:absolute; border:1px solid #B9B9B9; width:134px; height:88px;"><img src="https://i.i-sgcm.com/images/spacer.gif" width="1" height="1" alt="spacer" /></div>
<img src="https://i.i-sgcm.com/new_cars/cars/12618/12618_m.jpg" width="136" height="90" border="0" alt="Toyota Corolla Altis" />
</div>
<div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
</a>
<div style="padding-bottom:14px;" class="font_black">$91,888</div>
</div>I right-clicked on the
<a href="newcars_overview.php?CarCode=12618" class="link">
portion and copied the "link location. I found that it was:https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12618 <-- from the source
https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618 <-- from your code
BTW, you might find this a little easier to manage than a for
loop. Iteratively building a frame is horribly inefficient, and while it may not be bad for the 18 entries I found it, it's not good in the long run (if you can avoid it).
info <- lapply(popularcars, function(h) {
details_url <- paste0("https://www.sgcarmart.com/new_cars/", html_attr(h,"href"))
details <- read_html(details_url)
html_text(html_node(details,".link_redbanner"))
})
str(info)
# List of 18
# $ : chr "Toyota Corolla Altis"
# $ : chr "Hyundai Venue"
# $ : chr "Hyundai Avante"
# $ : chr "SKODA Octavia"
# $ : chr "Honda Civic"
# $ : chr "Mazda 3 Sedan Mild Hybrid"
# $ : chr "Honda Jazz"
# $ : chr "Kia Cerato"
# $ : chr "Mazda CX-5"
# $ : chr "Mercedes-Benz GLA-Class(Parallel Imported)"
# $ : chr "Toyota Raize(Parallel Imported)"
# $ : chr "Toyota Camry Hybrid(Parallel Imported)"
# $ : chr "Mercedes-Benz A-Class Hatchback(Parallel Imported)"
# $ : chr "Mercedes-Benz A-Class Saloon(Parallel Imported)"
# $ : chr "Honda Fit(Parallel Imported)"
# $ : chr "Mercedes-Benz C-Class Saloon(Parallel Imported)"
# $ : chr "Mercedes-Benz CLA-Class(Parallel Imported)"
# $ : chr "Honda Freed Hybrid(Parallel Imported)"
Last point: while this is a worthwhile learning endeavor, that website's Terms of Service are clear to state: "You agree that you will not: ... engage in mass automated, systematic or any form of extraction of the material ("the Content") on our Website". I assume your efforts are under that limit.
Related Topics
How to Join Two Dataframes by Nearest Time-Date
Convert *Some* Column Classes in Data.Table
R Strsplit with Multiple Unordered Split Arguments
How to Format Axis Labels with Exponents with Ggplot2 and Scales
Merge Many Data Frames from CSV Files, When Id Column Is Implied
How to Get Google Search Results
Can Dcast Be Used Without an Aggregate Function
Extract Last Word in String in R
Control Point Border Thickness in Ggplot
Recommendations for Windows Text Editor for R
Rvest Error in Open.Connection(X, "Rb"):Timeout Was Reached
Alternative to R's 'Memory.Size()' in Linux
Removing Na Observations with Dplyr::Filter()
Replace Missing Values (Na) with Blank (Empty String)
From Data Table, Randomly Select One Row Per Group