How to Use Cookies with Rcurl

Post request using cookies with cURL, RCurl and httr

Based on Juba suggestion, here is a working RCurl template.

The code emulates a browser behaviour, as it:

retrieves cookies on a login screen and
reuses them on the following page requests containing the actual data.

### RCurl login and browse private pages ###

library("RCurl")

loginurl ="http=//www.*****"
mainurl  ="http=//www.*****"
agent    ="Mozilla/5.0"

#User account data and other login pars
pars=list(
     RURL="http=//www.*****",
     Username="*****",
     Password="*****"
)

#RCurl pars     
curl = getCurlHandle()
curlSetOpt(cookiejar="cookiesk.txt",  useragent = agent, followlocation = TRUE, curl=curl)
#or simply
#curlSetOpt(cookiejar="", useragent = agent, followlocation = TRUE, curl=curl)

#post login form
web=postForm(loginurl, .params = pars, curl=curl)

#go to main url with real data
web=getURL(mainurl, curl=curl)

#parse/print content of web
#..... etc. etc.


#This has the side effect of saving cookie data to the cookiejar file 
rm(curl)
gc()

How to download a file using cookies with Rcurl

You can use getBinaryURL:

myBin <- getBinaryURL(my_picture, curl = curl)
writeBin(myBin, "my_pic.png")

Using R to accept cookies to download a PDF file

Your plea on rOpenSci has been heard!

There's lots of javascript between those pages that makes it somewhat annoying to try to decipher via httr + rvest. Try RSelenium. This worked on OS X 10.11.2, R 3.2.3 & Firefox loaded.

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

Now wait for the download to complete. The R console will not be busy while it downloads, so it is easy to close the session accidently, before the download has completed.

# close the session
dr$close()

How to use R to download a zipped file from a SSL page that requires cookies

This is a bit easier to do with httr because it sets up everything so that cookies and https work seamlessly.

The easiest way to generate the cookies is to have the site do it for you, by manually posting the information that the "I agree" form generates. You then do a second request to download the actual file.

library(httr)
terms <- "http://www.icpsr.umich.edu/cgi-bin/terms"
download <- "http://www.icpsr.umich.edu/cgi-bin/bob/zipcart2"

values <- list(agree = "yes", path = "SAMHDA", study = "32722", ds = "", 
  bundle = "all", dups = "yes")

# Accept the terms on the form, 
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)
resp <- GET(download, query = values)

# write the content of the download to a binary file
writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

How to set the domain of cookies in R using rvest, httr, curl?

Adding the argument config = config(cookiejar = 'cookies.txt') to your rvest command, e.g., submit_form(session = s, form = f, config = config(cookiejar = 'cookies.txt')), does the trick. There's no need to have previously generated a file named cookies.txt, btw: it's all done automagically.

Set cookies with rvest

You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.

options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)

url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'

#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
    xml_attr(xml_find_first(req_html, paste0(".//input[@id='",x,"']")), "value")
})
names(viewheaders) <- fields

#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
    list(
        "M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
        "M$C$csfFilter$ddlNameType"="Any",
        "M$C$csfFilter$ddlField"="Elections",
        "M$C$csfFilter$ddlReportYear"="2017",
        "M$C$csfFilter$ddlStatus"="Default",
        "M$C$csfFilter$ddlValue"=-1,
        "M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")

#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))

You might need to play around with the inputs/code to get what you want precisely.

Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r