Rvest Error in Open.Connection(X, "Rb"):Timeout Was Reached

Iterating rvest scrape function gives: Error in open.connection(x, rb ) : Timeout was reached

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
if (!(links[i] %in% names(d))) {
cat(paste("Doing", links[i], "..."))
ok <- FALSE
counter <- 0
while (ok == FALSE & counter <= 5) {
counter <- counter + 1
out <- tryCatch({
scrape_test(links[i])
},
error = function(e) {
Sys.sleep(2)
e
}
)
if ("error" %in% class(out)) {
cat(".")
} else {
ok <- TRUE
cat(" Done.")
}
}
cat("\n")
d[[i]] <- out
names(d)[i] <- links[i]
}
}

Error in open.connection(x, rb ) : HTTP error 500 when using map_df

You can use purrr::possibly:

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

library(tidyverse)
library(rvest)

map_df(0:573, possibly(~{

pagina <- read_html(sprintf(url_silla, .x, '%s', '%s', '%s', '%s'))

print(.x)

data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)

}, NULL)) -> noticias_silla

Error in open.connection(x, rb ) : HTTP error 404 in R


TL;DR

Add a slash:

  details_url <- paste0("https://www.sgcarmart.com/new_cars/",html_attr(h,"href"))
# ---> ---> ---> ---> ^

The Journey

  1. I ran your source far enough to get the popularcars, and looked at the first one:

    h <- popularcars[[1]]
    h
    # {html_node}
    # <a href="newcars_overview.php?CarCode=12618" class="link">
    # [1] <div style="position:relative; padding-bottom:6px;">\r\n < ...
    # [2] <div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
    details_url <- paste0("https://www.sgcarmart.com/new_cars",html_attr(h,"href"))
    details_url
    # [1] "https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618"

    Like you, for me that URL returned 404.

  2. I navigated (in a boring, normal, browser) to the main URL, looked at the source for the page, and searched for 12618:

    <div style="padding:10px 10px 5px 10px;" id="nc_popular_car">
    <div class="floatleft" style="text-align:center;width:136px;padding-right:22px;">
    <a href="newcars_overview.php?CarCode=12618" class="link">
    <div style="position:relative; padding-bottom:6px;">
    <div style="position:absolute; border:1px solid #B9B9B9; width:134px; height:88px;"><img src="https://i.i-sgcm.com/images/spacer.gif" width="1" height="1" alt="spacer" /></div>
    <img src="https://i.i-sgcm.com/new_cars/cars/12618/12618_m.jpg" width="136" height="90" border="0" alt="Toyota Corolla Altis" />
    </div>
    <div style="padding-bottom:3px;" class="limittwolines">Toyota Corolla Altis</div>
    </a>
    <div style="padding-bottom:14px;" class="font_black">$91,888</div>
    </div>
  3. I right-clicked on the <a href="newcars_overview.php?CarCode=12618" class="link"> portion and copied the "link location. I found that it was:

    https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12618 <-- from the source
    https://www.sgcarmart.com/new_carsnewcars_overview.php?CarCode=12618 <-- from your code

BTW, you might find this a little easier to manage than a for loop. Iteratively building a frame is horribly inefficient, and while it may not be bad for the 18 entries I found it, it's not good in the long run (if you can avoid it).

info <- lapply(popularcars, function(h) {
details_url <- paste0("https://www.sgcarmart.com/new_cars/", html_attr(h,"href"))
details <- read_html(details_url)
html_text(html_node(details,".link_redbanner"))
})

str(info)
# List of 18
# $ : chr "Toyota Corolla Altis"
# $ : chr "Hyundai Venue"
# $ : chr "Hyundai Avante"
# $ : chr "SKODA Octavia"
# $ : chr "Honda Civic"
# $ : chr "Mazda 3 Sedan Mild Hybrid"
# $ : chr "Honda Jazz"
# $ : chr "Kia Cerato"
# $ : chr "Mazda CX-5"
# $ : chr "Mercedes-Benz GLA-Class(Parallel Imported)"
# $ : chr "Toyota Raize(Parallel Imported)"
# $ : chr "Toyota Camry Hybrid(Parallel Imported)"
# $ : chr "Mercedes-Benz A-Class Hatchback(Parallel Imported)"
# $ : chr "Mercedes-Benz A-Class Saloon(Parallel Imported)"
# $ : chr "Honda Fit(Parallel Imported)"
# $ : chr "Mercedes-Benz C-Class Saloon(Parallel Imported)"
# $ : chr "Mercedes-Benz CLA-Class(Parallel Imported)"
# $ : chr "Honda Freed Hybrid(Parallel Imported)"

Last point: while this is a worthwhile learning endeavor, that website's Terms of Service are clear to state: "You agree that you will not: ... engage in mass automated, systematic or any form of extraction of the material ("the Content") on our Website". I assume your efforts are under that limit.



Related Topics



Leave a reply



Submit