How to Download a Large Binary File with Rcurl *After* Server Authentication

how to download a large binary file with RCurl after server authentication

this is now possible with the httr package. thanks hadley!

https://github.com/hadley/httr/issues/44

How to use R to download a zipped file from a SSL page that requires cookies

This is a bit easier to do with httr because it sets up everything so that cookies and https work seamlessly.

The easiest way to generate the cookies is to have the site do it for you, by manually posting the information that the "I agree" form generates. You then do a second request to download the actual file.

library(httr)
terms <- "http://www.icpsr.umich.edu/cgi-bin/terms"
download <- "http://www.icpsr.umich.edu/cgi-bin/bob/zipcart2"

values <- list(agree = "yes", path = "SAMHDA", study = "32722", ds = "", 
  bundle = "all", dups = "yes")

# Accept the terms on the form, 
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)
resp <- GET(download, query = values)

# write the content of the download to a binary file
writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

Create a C-level file handle in RCurl for writing downloaded files

I think you want to use writedata and remember to close the file

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://cran.fhcrc.org/Rlogo.jpg"
curlPerform(url = url, writedata = f@ref)
close(f)

For more elaborate writing, I'm not sure if this is the best way, but Linux tells me, from

man curl_easy_setopt

that there's a curl option CURL_WRITEFUNCTION that is a pointer to a C function with prototype

size_t function(void *ptr, size_t  size, size_t nmemb, void *stream);

and in R at the end of ?curlPerform there's an example of calling a C function as the 'writefunction' option. So I created a file curl_writer.c

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    fprintf(stderr, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    return size * nmemb;
}

Compiled it

R CMD SHLIB curl_writer.c

which on Linux produces a file curl_writer.so, and then in R

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

and get on stderr

<writer> size = 1, nmemb = 2653
<writer> size = 1, nmemb = 520
OK

These two ideas can be integrated, i.e., writing to an arbitrary file using an arbitrary function, by modifying the C function to use the FILE * we pass in, as

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    FILE *fout = (FILE *) stream;
    fprintf(fout, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    fflush(fout);
    return size * nmemb;
}

and then back in R after compiling

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f@ref, writefunction=writer)
close(f)

getURL can be used here, too, provided writedata=f@ref, write=writer; I think the problem in the original question is that R_curl_write_binary_data is really an internal function, writing to a buffer managed by RCurl, rather than a file handle like that created by CFILE. Likewise, specifying writedata without write (which seems from the source code to getURL to be an alias for writefunction) sends a pointer to a file to a function expecting a pointer to something else; for getURL both writedata and write need to be provided.

Using RCurl or any other R package

Well, tbh, you've done no research and want folks to give you special treatment, which is fine, but not going to get you very far on SO. There are tons of questions on SO for RCurl and loads of web sites that specifically talk about how to use it in the context of FTP downloads.

But, the following might help someone who has done some research and is truly stuck, plus will also show how to use the more modern curl and httr packages.

On top of some RCurl tutoring you kinda also expected folks to register for that site (since one might have assumed there were idiosyncrasies in that site's FTP server that were causing issues with RCurl…I mean, we have no context so that's as valid an assumption as any).

Put these in ~/.Renviron and restart your R session:

ACRI_FTP_USERNAME=your-username
ACRI_FTP_PASSWORD=your-password

Do some basic research (it's in the manuals on the R-Project site) on getting environment variables into R if you've not done that before.

If you don't do at least that you're putting bare credentials into scripts which is horribad for security. There are other ways to manage "secrets" more formally, but I suspect these FTP credentials aren't exactly "super-secret" bits of info. Doing this can also make any scripts more generic (i.e. others can use them if they follow the same pattern and use their own creds).

We'll use curl and httr:

library(curl)
library(httr)

You may not want to use your browser to look at directory listings and browsers may stop supporting FTP soon (Mozilla is abandoning support for reading RSS feeds and neither Chrome nor Firefox can read Gopher sites, so you never know). Browsers also tend to be super slow with FTP things for some reason.

We'll make a function to make it easier to do directory listings:

get_dir_listing <- function(path = "/") {
  curl_fetch_memory(
    paste0("ftp://ftp.hermes.acri.fr", path),
    new_handle(
      username = Sys.getenv("ACRI_FTP_USERNAME"),
      password = Sys.getenv("ACRI_FTP_PASSWORD"),
      dirlistonly=TRUE
    )
  ) -> res

  strsplit(readBin(res$content, "character"), "\n")[[1]]

}

Now we can do (we'll go down one tree and slashes matter):

get_dir_listing()
## [1] "GLOB"      "animation" "OSS2015"   "EURO"     

get_dir_listing("/GLOB/")
## [1] "meris"   "viirsn"  "merged"  "olcia"   "modis"   "seawifs"

get_dir_listing("/GLOB/meris/")
## [1] "month" "8-day" "day"  

get_dir_listing("/GLOB/meris/month/")
## [1] "2011" "2002" "2006" "2012" "2005" "2009" "2004" "2008" "2007" "2010" "2003"

get_dir_listing("/GLOB/meris/month/2011/")
## [1] "09" "05" "01" "12" "06" "02" "11" "03" "10" "07" "08" "04"

get_dir_listing("/GLOB/meris/month/2011/09/")
## [[1]] "01"

Jackpot!

get_dir_listing("/GLOB/meris/month/2011/09/01/")
##  [1] "L3b_20110901-20110930__GLOB_4_AV-MER_KD490-LEE_MO_00.nc"       
##  [2] "L3m_20110901-20110930__GLOB_25_AV-MER_ZHL_MO_00.nc"            
##  [3] "L3b_20110901-20110930__GLOB_4_AV-MER_ZSD_MO_00.nc"             
##  [4] "L3m_20110901-20110930__GLOB_100_AV-MER_ZSD_MO_00.nc"           
##  [5] "L3b_20110901-20110930__GLOB_4_AV-MER_A865_MO_00.nc"            
##  [6] "L3m_20110901-20110930__GLOB_100_AV-MER_A865_MO_00.nc"          
##  [7] "L3m_20110901-20110930__GLOB_25_AV-MER_CHL1_MO_00.png"          
##  [8] "L3m_20110901-20110930__GLOB_25_AV-MER_CF_MO_00.png"            
##  [9] "L3m_20110901-20110930__GLOB_25_AV-MER_NRRS443_MO_00.png"       
## [10] "L3m_20110901-20110930__GLOB_4_AV-MER_CHL-OC5_MO_00.nc"         
## [11] "L3m_20110901-20110930__GLOB_100_AV-MER_KDPAR_MO_00.nc"         
## [12] "L3b_20110901-20110930__GLOB_4_AV-MER_NRRS670_MO_00.nc"         
## [13] "L3m_20110901-20110930__GLOB_25_AV-MER_NRRS490_MO_00.png"       
## [14] "L3b_20110901-20110930__GLOB_4_AV-MER_NRRS412_MO_00.nc"         
## [15] "L3m_20110901-20110930__GLOB_4_AV-MER_A865_MO_00.nc"            
## [16] "L3m_20110901-20110930__GLOB_4_AV-MER_NRRS490_MO_00.nc"         
## [17] "L3m_20110901-20110930__GLOB_25_AV-MER_KD490_MO_00.png"         
## [18] "L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc"           
## [19] "L3b_20110901-20110930__GLOB_4_AV-MER_T550_MO_00.nc"            
## [20] "L3m_20110901-20110930__GLOB_25_AV-MER_CHL-OC5_MO_00.png"       
## [21] "L3m_20110901-20110930__GLOB_25_AV-MER_ZSD-DORON_MO_00.nc"  
## .. there are alot of them

Now you probably want to download one of them. I know .nc files are generally huge even though I never have to use them b/c I've read and answered alot of SO questions about them.

We'll use httr for the download as it takes care of a bunch of things for us:

httr::GET(
  url = "ftp://ftp.hermes.acri.fr/GLOB/meris/month/2011/09/01/L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc",
  httr::authenticate(Sys.getenv("ACRI_FTP_USERNAME"), Sys.getenv("ACRI_FTP_PASSWORD")),
  httr::write_disk("~/Data/L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc"),
  httr::progress()
) -> res

httr::stop_for_status(res)

You can safely ignore the warnings and diagnostics:

## Warning messages:
## 1: In parse_http_status(lines[[1]]) :
##   NAs introduced by coercion to integer range
## 2: Failed to parse headers:
## 229 Entering Extended Passive Mode (|||28926|)
## 200 Type set to I
## 213 92373747
## 150 Opening BINARY mode data connection for L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc (92373747 bytes)
## 226 Transfer complete

Because it has the proper magic headers for the file command:

$ file L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc
L3m_20110901-20110930__GLOB_4_GSM-MER_CHL1_MO_00.nc: Hierarchical Data Format (version 5) data

Hopefully this did help out someone truly stuck since there are (as stated) loads of content on SO and elsewhere about how to authenticate to FTP servers, perform directory traversals and download content. This is now one more added to that corpus.

Using R to access FTP Server and Download Files Results in Status 530 Not logged in

I think you're going to need a more defensive strategy when working with this FTP server:

library(curl)  # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar

# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"

dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
                              ssl_verifypeer=FALSE, ftp_response_timeout=30)

# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)

# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6

# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {

  i <- 0
  repeat {
    i <- i + 1
    res <- s_curl_fetch_memory(url, handle=handle)
    if (!is.null(res$result)) return(res$result)
    if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
  }

}

# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {

  # you should prbly be a bit more thorough than `basename` since
  # i think there are issues with the 1971 and 1972 filenames. 
  # Gotta leave some work up to the OP
  cache_file <- sprintf("%s/%s", cache_dir, basename(url))
  if (file.exists(cache_file)) return()

  i <- 0
  repeat {
    i <- i + 1
    if (i==6) { stop("Too many retries...server may be under load") }
    res <- s_curl_fetch_disk(url, cache_file)
    if (!is.null(res$result)) return()
  }

}

# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
             "984260-41231", "984290-99999", "984300-99999", "984320-99999",
             "984330-99999")
years <- 1960:2016

# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {

  # the year we're working on
  year_url <- sprintf(ftp_base, yr)

  # fetch the directory listing
  tmp <- retry_cfm(year_url, handle=dir_list_handle)
  con <- rawConnection(tmp$content)
  fils <- readLines(con)
  close(con)

  # sift out only the target stations
  map(station, ~grep(., fils, value=TRUE)) %>%
    keep(~length(.)>0) %>%
    flatten_chr() -> fils

  # grab the stations files
  walk(paste(year_url, fils, sep=""), retry_cfd)

  # tick off progress
  pb$tick()$print()

})

You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

Use R to mimic clicking on a file to download it

Some libs to help

You actually will need only dplyr, purrr, stringr, and xml2.

library(tidyverse)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
library(htmltab)
library(xml2)
library(readxl)

I like to do it this way because some sites use partial links.

base <- "https://rigcount.bakerhughes.com"
url <- paste0(base, "/na-rig-count")

# find links
url_html <- xml2::read_html(url)
url_html %>% 
  html_nodes("a") %>% 
  html_attrs() %>% 
  bind_rows() -> url_tbl

Check href content, find some pattern you are interested in.
You may use inspect on your browser too, it is truly helpful.

url_tbl %>% 
  count(href)
#> # A tibble: 22 x 2
#>    href                                                                        n
#>    <chr>                                                                   <int>
#>  1 /                                                                           1
#>  2 /email-alerts                                                               1
#>  3 /intl-rig-count                                                             1
#>  4 /na-rig-count                                                               1
#>  5 /rig-count-faqs                                                             1
#>  6 /rig-count-overview                                                         2
#>  7 #main-menu                                                                  1
#>  8 https://itunes.apple.com/app/baker-hughes-rig-counts/id393570114?mt=8       1
#>  9 https://rigcount.bakerhughes.com/static-files/4ab04723-b638-4310-afd9-…     1
#> 10 https://rigcount.bakerhughes.com/static-files/4b92b553-a48d-43a3-b4d9-…     1
#> # … with 12 more rows

Perhaps, I notice that static-files may be a good pattern to href but then I found a better one, in type.

url_tbl %>% 
  filter(str_detect(type, "ms-excel")) -> url_xlsx

build our list (remember to avoid some noise as an extra dot, spaces, and special characters)
I hope someone proposes a better way to avoid those things

myFiles <- pull(url_xlsx, "href")
names <- pull(url_xlsx, "title")
names(myFiles) <- paste0(
    str_replace_all(names, "[\\.\\-\\ ]", "_"), 
    str_extract(names, ".\\w+$")
)

# download data
myFiles %>% 
  imap(
    ~ download.file(
      url = .x, 
      destfile = .y,
      method="curl", # might be not necessary 
      extra="-k"
    )
  )
#> $`north_america_rotary_rig_count_jan_2000_-_current.xlsb`
#> [1] 0
#> 
#> $`north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb`
#> [1] 0
#> 
#> $`U.S.  Monthly Averages by State 1992-2016.xls`
#> [1] 0
#> 
#> $`North America Rotary Rig Counts through 2016.xls`
#> [1] 0
#> 
#> $`U.S. Annual Averages by State 1987-2016.xls`
#> [1] 0
#> 
#> $Workover_9.xls
#> [1] 0

^{Created on 2020-12-16 by the reprex package (v0.3.0)}

Now you may see your files.

names(myFiles) %>%
    map(
        readxlsb:read_xlsb
    ) -> myData

I hope it helps.

Passing correct params to RCurl/postForm

Does this provide the correct PDF?

library(httr)
library(rvest)
library(purrr)

# setup inane sharepoint viewstate parameters
res <- GET(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
           query=list(parID_RSSD=2162966, parDT_END=99991231))

# extract them
pg <- content(res, as="parsed")
hidden <- html_nodes(pg, xpath=".//form/input[@type='hidden']") 
params <- setNames(as.list(xml_attr(hidden, "value")), xml_attr(hidden, "name"))

# pile on more params
params <- c(
  params, 
  grpInstitution = "rbCurInst", 
  lbTopHolders = "2961897", 
  grpHMDA = "rbNonHMDA", 
  lbTypeOfInstitution = "-99", 
  txtAsOfDate = "12/28/2016", 
  txtAsOfDateErrMsg = "", 
  lbHMDAYear = "2015", 
  grpRptFormat = "rbRptFormatPDF", 
  btnSubmit = "Submit"
)

# submit the req and save to disk
POST(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
     query=list(parID_RSSD=2162966, parDT_END=99991231),
     add_headers(Origin = "https://www.ffiec.gov"), 
     body = params, 
     encode = "form", 
     write_disk("/tmp/output.pdf")) -> res2

wget/curl large file from google drive

WARNING: This functionality is deprecated. See warning below in comments.

Have a look at this question: Direct download from Google Drive using Google Drive API

Basically you have to create a public directory and access your files by relative reference with something like

wget https://googledrive.com/host/LARGEPUBLICFOLDERID/index4phlat.tar.gz

Alternatively, you can use this script: https://github.com/circulosmeos/gdown.pl

How to Download a Large Binary File with Rcurl After Server Authentication