R - Error When Using Geturl from Curl After Site Was Changed

Object Moved error in using the RCurl getURL function in order to access an ASP Webpage

Use the followlocation curl option:

getURL(u,.opts=curlOptions(followlocation=TRUE))

with added cookiefile goodness - its supposed to be a file that doesnt exist, but I'm not sure how you can be sure of that:

w=getURL(u,.opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"))

RCurl::getURL sporadically fails when run in loop

Ok, I have a solution that has not failed for me:
I create a try catch with a max attempt iterator, default 5 attempts with a wait time of 1 second, in addition to a general wait time of 0.05 seconds per accepted url request.

Let me know if anyone has a safer idea:

safe.url <- function(url, attempt = 1, max.attempts = 5) {
tryCatch(
expr = {
Sys.sleep(0.05)
RCurl::getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)
},
error = function(e){
if (attempt >= max.attempts) stop("Server is not responding to download data,
wait 30 seconds and try again!")
Sys.sleep(1)
safe.url(url, attempt = attempt + 1, max.attempts = max.attempts)
})
}

url <- "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/056/SRR10503056/SRR10503056.fastq.gz"
for (i in 1:100){
safe.url(url)
}

Unable to access directory of HTML site using R (RCurl package)

I think the issue here is that the server only supports requests using TLS version 1.2 and your RCurl does not support it.

You might be able to achieve what you want using httr and rvest. For example, to get a tibble listing the files in the 1929 directory:

library(httr)
library(rvest)

url1 <- "https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/1929"
page_data <- GET(url1)

files <- content(page_data, as = "parsed") %>%
html_table() %>%
.[[1]]

files

# A tibble: 24 x 4
Name `Last modified` Size Description
<chr> <chr> <chr> <lgl>
1 "" "" "" NA
2 "Parent Directory" "" "-" NA
3 "03005099999.csv" "2019-01-19 12:37" "20K" NA
4 "03075099999.csv" "2019-01-19 12:37" "20K" NA
5 "03091099999.csv" "2019-01-19 12:37" "17K" NA
6 "03159099999.csv" "2019-01-19 12:37" "20K" NA
7 "03262099999.csv" "2019-01-19 12:37" "20K" NA
8 "03311099999.csv" "2019-01-19 12:37" "19K" NA
9 "03379099999.csv" "2019-01-19 12:37" "33K" NA
10 "03396099999.csv" "2019-01-19 12:37" "21K" NA
# ... with 14 more rows

Get final URL after curl is redirected

curl's -w option and the sub variable url_effective is what you are
looking for.

Something like

curl -Ls -o /dev/null -w %{url_effective} http://google.com

More info


-L Follow redirects
-s Silent mode. Don't output anything
-o FILE Write output to <file> instead of stdout
-w FORMAT What to output after completion

More

You might want to add -I (that is an uppercase i) as well, which will make the command not download any "body", but it then also uses the HEAD method, which is not what the question included and risk changing what the server does. Sometimes servers don't respond well to HEAD even when they respond fine to GET.

How to avoid null character when scraping a web page with getURL() in R?

RCurl::getURL() seems to not be detecting either the Content-Encoding: gzip header nor the tell-tale first two byte "magic" code that also signals the content is gzip encoded.

I would suggest — as Michael did — switching to httr for reasons I'll go into in a bit, but this wld be a better httr idiom:

library(httr)

res <- GET("http://dogecoin.com/")
content(res)

The content() function extracts the raw response and returns an xml2 object which is similar to the XML library parsed object that you would likely have been using given the use of RCurl::getURL().

The alternative way is to add some crutches to RCurl::getURL():

html_text_res <- RCurl::getURL("http://dogecoin.com/", encoding="gzip")

Here, we're explicitly informing getURL() that the content is gzip'd, but that's fraught with peril in that if the upstream server decides to use, say, brotli encoding, then you'll get an error.

If you still want to use RCurl vs switch to httr I'd suggest doing the following for this site:

RCurl::getURL("http://dogecoin.com/", 
encoding = "gzip",
httpheader = c(`Accept-Encoding` = "gzip"))

Here' were giving getURL() the decoding crutch but also explicitly telling the upstream server that gzip is and that it should send data with that encoding.

However, httr would be a better choice since it and the curl package it uses deal with web server interaction and content in a more thorough way.



Related Topics



Leave a reply



Submit