Rcurl: Url.Exists Returns False When Url Does Exist

Why url.exists returns FALSE when the URL does exists using RCurl?

guys. Thanks a lot for help. I think I just figured out how to do it. The important thing is proxy. If I use:

> opts <- list(
     proxy         = "http://*******",
     proxyusername = "*****", 
     proxypassword = "*****", 
     proxyport     = 8080
)
> url.exists("http://www.google.com",.opts = opts)
[1] TRUE

Then all done! You can find your proxy under System-->proxy if you use win 10. At the same time:

 > site <- getForm("http://www.google.com.au", hl="en",
                 lr="", q="r-project", btnG="Search",.opts = opts)
 > htmlTreeParse(site)
 $file
 [1] "<buffer>"
 .........

In getForm, opts needs to be put in as well. There are two posters here (RCurl default proxy settings and Proxy setting for R) answering the same question. I have not tried how to extract information from here.

Rcurl: url.exists returns false when url does exist

That webserver appears to return a 403 Forbidden error when your HTTP request does not include a user-agent string. RCurl by default does not pass a user-agent. You can set one with the useragent= parameter.

myurl<-"http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA"
url.exists(myurl, useragent="curl/7.39.0 Rcurl/1.95.4.5")
# [1] TRUE
htmlTreeParse(getURL(myurl, useragent="curl/7.39.0 Rcurl/1.95.4.5"))

The httr package is a bit nicer than RCurl for making HTTP requests in my opinion (and it sets a user-agent string by default). Here's the corresponding code

library(httr)
GET(myurl)

Checking if URLs exist in R

Both solutions are based on libcurl.
Default user agent of httr includes versions of Curl, RCurl and httr.
You can check it with verbose mode:

> httr::HEAD(some_urls[1], httr::verbose())
-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: libcurl/7.68.0 r-curl/4.3.2 httr/1.4.3    <<<<<<<<< Here is the problem. I think the site disallows webscraping. You need to check the related robots.txt file(s).
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 403 
<- date: Wed, 27 Jul 2022 20:56:28 GMT
<- content-type: text/html; charset=iso-8859-1
<- server: Apache/2.4.53 (Amazon)
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 20:56
  Status: 403
  Content-Type: text/html; charset=iso-8859-1
<EMPTY BODY>

You can set user-agent header per function calls. I do not know the global option way in this case:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> httr::HEAD(some_urls[1], user_agent, httr::verbose())

-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 200 
<- date: Wed, 27 Jul 2022 21:01:07 GMT
<- content-type: text/html; charset=utf-8
<- server: Apache/2.4.54 (Amazon)
<- x-powered-by: PHP/7.0.33
<- content-language: en-US
<- x-frame-options: SAMEORIGIN
<- expires: Wed, 27 Jul 2022 22:01:07 GMT
<- cache-control: private, max-age=3600
<- last-modified: Wed, 27 Jul 2022 21:01:07 GMT
<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly

NOTE: bunch of set-cookie deleted here

<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly
<- via: 1.1 ZZ
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 21:01
  Status: 200
  Content-Type: text/html; charset=utf-8
<EMPTY BODY>

NOTE: I did not investigate the url.exists of RCurl. You need to ensure somehow it uses the right user-agent string.

In a nutshell with no verbose:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> (httr::status_code(httr::HEAD(some_urls[1], user_agent)) %/% 200) == 1
[1] TRUE
>

I think you can write your own solution from here.

Recognize forwarding when checking if url exists

That's because this server returns 200 OK when you send a HEAD request (like url.exists() and http_error() do). When send a GET request you receive the 404 NOT FOUND.

So you can do

httr::http_error(httr::GET(url))
#> TRUE

Even better, you can save the result of the GET request and process it's content. This way you only need one request in any case. If there is an error you skip it, otherwise you process the result (e.g. with xml2 or whatever you use)

Check if URL exists in R

pingr::ping() only uses ICMP which is blocked on sane organizational networks since attackers used ICMP as a way to exfiltrate data and communicate with command-and-control servers.

pingr::ping_port() doesn't use the HTTP Host: header so the IP address may be responding but the target virtual web host may not be running on it and it definitely doesn't validate that the path exists at the target URL.

You should clarify what you want to happen when there are only non-200:299 range HTTP status codes. The following makes an assumption.

NOTE: You used Amazon as an example and I'm hoping that's the first site that just "came to mind" since it's unethical and a crime to scrape Amazon and I would appreciate my code not being brought into your universe if you are in fact just a brazen content thief. If you are stealing content, it's unlikely you'd be up front here about that, but on the outside chance you are both stealing and have a conscience, please let me know so I can delete this answer so at least other content thieves can't use it.

Here's a self-contained function for checking URLs:

#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#'        arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#'        to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {

  suppressPackageStartupMessages({
    require("httr", quietly = FALSE, warn.conflicts = FALSE)
  })

  # you don't need thse two functions if you're alread using `purrr`
  # but `purrr` is a heavyweight compiled pacakge that introduces
  # many other "tidyverse" dependencies and this doesnt.

  capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
    tryCatch(
      list(result = code, error = NULL),
      error = function(e) {
        if (!quiet)
          message("Error: ", e$message)

        list(result = otherwise, error = e)
      },
      interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
      }
    )
  }

  safely <- function(.f, otherwise = NULL, quiet = TRUE) {
    function(...) capture_error(.f(...), otherwise, quiet)
  }

  sHEAD <- safely(httr::HEAD)
  sGET <- safely(httr::GET)

  # Try HEAD first since it's lightweight
  res <- sHEAD(x, ...)

  if (is.null(res$result) || 
      ((httr::status_code(res$result) %/% 200) != 1)) {

    res <- sGET(x, ...)

    if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors

    if (((httr::status_code(res$result) %/% 200) != 1)) {
      if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
      return(non_2xx_return_value)
    }

    return(TRUE)

  } else {
    return(TRUE)
  }

}

Give it a go:

c(
  "http://content.thief/",
  "http://rud.is/this/path/does/not_exist",
  "https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft", 
  "https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0", 
  "https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
) -> some_urls

data.frame(
  exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
  some_urls,
  stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
##  A tibble: 5 x 2
##   exists some_urls                                                                           
##   <lgl>  <chr>                                                                               
## 1 NA     http://content.thief/                                                               
## 2 FALSE  http://rud.is/this/path/does/not_exist                                              
## 3 TRUE   https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
## 4 TRUE   https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
## 5 TRUE   https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
## Warning message:
## In FUN(X[[i]], ...) :
##   Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range

Does this URL exist? RCurl says no

My guess would be that url.exists is using a HTTP HEAD-request, which the server seems unable to handle:

$ telnet patft.uspto.gov 80
Trying 151.207.240.26...
Connected to patft.uspto.gov.
Escape character is '^]'.
HEAD /netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=/netahtml/PTO/search-adv.htm&r=10&f=G&l=50&d=PTXT&OS=AN/(nortel)&RS=AN/nortel&Query=AN/(nortel)&Srch1=nortel.ASNM.&NextList1=Next+50+Hits HTTP/1.1
Host: patft.uspto.gov
Connection: close

Connection closed by foreign host.

So server broken, not RCurl.