Ping a Website in R

Ping a website in R

We can use a system2 call to get the return status of the ping command in the shell. On Windows (and probably linux) following will work :

ping <- function(x, stderr = FALSE, stdout = FALSE, ...){
    pingvec <- system2("ping", x,
                       stderr = FALSE,
                       stdout = FALSE,...)
    if (pingvec == 0) TRUE else FALSE
}

# example
> ping("google.com")
[1] FALSE
> ping("ugent.be")
[1] TRUE

If you want to capture the output of ping, you can either set stdout = "", or use a system call:

> X <- system("ping ugent.be", intern = TRUE)
> X
 [1] ""                                                         "Pinging ugent.be [157.193.43.50] with 32 bytes of data:" 
 [3] "Reply from 157.193.43.50: bytes=32 time<1ms TTL=62"       "Reply from 157.193.43.50: bytes=32 time<1ms TTL=62"      
 [5] "Reply from 157.193.43.50: bytes=32 time<1ms TTL=62"       "Reply from 157.193.43.50: bytes=32 time<1ms TTL=62"      
 [7] ""                                                         "Ping statistics for 157.193.43.50:"                      
 [9] "    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss)," "Approximate round trip times in milli-seconds:"          
[11] "    Minimum = 0ms, Maximum = 0ms, Average = 0ms"

using the option intern = TRUE makes it possible to save the output in a vector. I leave it to the reader as an exercise to rearrange this in order to get some decent output.

Ping a Website or an IP Address (or Check if a Website is Online) using Swift 4?

I guess the problem is easy: You enabled App Sandbox but didn't check Outgoing Connections.

What about your pingHost method - it's completely correct. So I think only problem is in App Sandbox settings.

Sample Image

Web-scraping paginated website with difficult node

It is bad practice to ping the same page more than once in order to extract the requested information. You should read the page, extract all of the desired information and then move to the next page.

In this case the individual nodes are all store in one master table. rvest's html_table() function is handy to convert a html table into a data frame.

library(rvest)
library(dplyr)

years <- seq(2010, 2015, by=1)

pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
  paste0(years) 

# Leaving out the category, location, operator, etc. nodes for sake of brevity 
read_table <- function(url){
  #add delay so that one is not attacking the host server (be polite)
  Sys.sleep(0.5)
  #read page
  page <- read_html(url)
  #extract the table out (the data frame is stored in the first element of the list)
  answer<-(page %>% html_nodes("table") %>%  html_table())[[1]]
  #convert the falatities column to character to make a standardize column type
  answer$fat. <-as.character(answer$fat.)
  answer
} 

# Writing to dataframe
aviation_df <- bind_rows(lapply(pages, read_table))

The are a few extra columns which will need clean-up

Check if URL exists in R

pingr::ping() only uses ICMP which is blocked on sane organizational networks since attackers used ICMP as a way to exfiltrate data and communicate with command-and-control servers.

pingr::ping_port() doesn't use the HTTP Host: header so the IP address may be responding but the target virtual web host may not be running on it and it definitely doesn't validate that the path exists at the target URL.

You should clarify what you want to happen when there are only non-200:299 range HTTP status codes. The following makes an assumption.

NOTE: You used Amazon as an example and I'm hoping that's the first site that just "came to mind" since it's unethical and a crime to scrape Amazon and I would appreciate my code not being brought into your universe if you are in fact just a brazen content thief. If you are stealing content, it's unlikely you'd be up front here about that, but on the outside chance you are both stealing and have a conscience, please let me know so I can delete this answer so at least other content thieves can't use it.

Here's a self-contained function for checking URLs:

#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#'        arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#'        to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {

  suppressPackageStartupMessages({
    require("httr", quietly = FALSE, warn.conflicts = FALSE)
  })

  # you don't need thse two functions if you're alread using `purrr`
  # but `purrr` is a heavyweight compiled pacakge that introduces
  # many other "tidyverse" dependencies and this doesnt.

  capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
    tryCatch(
      list(result = code, error = NULL),
      error = function(e) {
        if (!quiet)
          message("Error: ", e$message)

        list(result = otherwise, error = e)
      },
      interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
      }
    )
  }

  safely <- function(.f, otherwise = NULL, quiet = TRUE) {
    function(...) capture_error(.f(...), otherwise, quiet)
  }

  sHEAD <- safely(httr::HEAD)
  sGET <- safely(httr::GET)

  # Try HEAD first since it's lightweight
  res <- sHEAD(x, ...)

  if (is.null(res$result) || 
      ((httr::status_code(res$result) %/% 200) != 1)) {

    res <- sGET(x, ...)

    if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors

    if (((httr::status_code(res$result) %/% 200) != 1)) {
      if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
      return(non_2xx_return_value)
    }

    return(TRUE)

  } else {
    return(TRUE)
  }

}

Give it a go:

c(
  "http://content.thief/",
  "http://rud.is/this/path/does/not_exist",
  "https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft", 
  "https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0", 
  "https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
) -> some_urls

data.frame(
  exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
  some_urls,
  stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
##  A tibble: 5 x 2
##   exists some_urls                                                                           
##   <lgl>  <chr>                                                                               
## 1 NA     http://content.thief/                                                               
## 2 FALSE  http://rud.is/this/path/does/not_exist                                              
## 3 TRUE   https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
## 4 TRUE   https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
## 5 TRUE   https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
## Warning message:
## In FUN(X[[i]], ...) :
##   Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range

How to access links using a different proxy server in R?

To use a proxy, you need to be able to connect to it. Are you sure you can connect to the proxy server 202.40.185.107:8080? You can try that easily by e.g. putting 202.40.185.107:8080 in your browser or trying to ping 202.40.185.107:8080 using command line.

You could try a different proxy. I found this one online and it is free. Just a word of caution - if you are using a proxy in order not to be blocked by the website owner, the proxy you would be using can be blocked by the website owner as well.

GET("http://had.co.nz", use_proxy("35.169.156.54", 3128), verbose())

How to determine if you have an internet connection in R

Here is an attempt at parsing the output from ipconfig/ifconfig, as suggested by Spacedman.

havingIP <- function() {
  if (.Platform$OS.type == "windows") {
    ipmessage <- system("ipconfig", intern = TRUE)
  } else {
    ipmessage <- system("ifconfig", intern = TRUE)
  }
  validIP <- "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
  any(grep(validIP, ipmessage))
}

With a simple TRUE/FALSE output

> havingIP()
[1] TRUE

Don't understand what ping website list is

The > operator is a redirection of the output of the command on the left to the file on the right. The basic syntax is:

command > file

This means that in your example ping www.example.com > list a ping command is done on www.example.com and instead of writing the output of the ping command to the command window it writes the output to the file named "list".

The ping command output looks like this:

C:\>ping www.example.com

Pinging www.example.com [93.184.216.34] with 32 bytes of data:
Reply from 93.184.216.34: bytes=32 time=93ms TTL=56
Reply from 93.184.216.34: bytes=32 time=93ms TTL=56
Reply from 93.184.216.34: bytes=32 time=93ms TTL=56
Reply from 93.184.216.34: bytes=32 time=93ms TTL=56

Ping statistics for 93.184.216.34:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 93ms, Maximum = 93ms, Average = 93ms

With your example it is not displayed on screen but written to the file "list". I suggest to use a proper file name extension for the target file name, e.g. "list.txt".

Ping a Website in R