Object Moved error in using the RCurl getURL function in order to access an ASP Webpage
Use the followlocation curl option:
getURL(u,.opts=curlOptions(followlocation=TRUE))
with added cookiefile goodness - its supposed to be a file that doesnt exist, but I'm not sure how you can be sure of that:
w=getURL(u,.opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"))
RCurl::getURL sporadically fails when run in loop
Ok, I have a solution that has not failed for me:
I create a try catch with a max attempt iterator, default 5 attempts with a wait time of 1 second, in addition to a general wait time of 0.05 seconds per accepted url request.
Let me know if anyone has a safer idea:
safe.url <- function(url, attempt = 1, max.attempts = 5) {
tryCatch(
expr = {
Sys.sleep(0.05)
RCurl::getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)
},
error = function(e){
if (attempt >= max.attempts) stop("Server is not responding to download data,
wait 30 seconds and try again!")
Sys.sleep(1)
safe.url(url, attempt = attempt + 1, max.attempts = max.attempts)
})
}
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/056/SRR10503056/SRR10503056.fastq.gz"
for (i in 1:100){
safe.url(url)
}
Unable to access directory of HTML site using R (RCurl package)
I think the issue here is that the server only supports requests using TLS version 1.2 and your RCurl does not support it.
You might be able to achieve what you want using httr
and rvest
. For example, to get a tibble listing the files in the 1929 directory:
library(httr)
library(rvest)
url1 <- "https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/1929"
page_data <- GET(url1)
files <- content(page_data, as = "parsed") %>%
html_table() %>%
.[[1]]
files
# A tibble: 24 x 4
Name `Last modified` Size Description
<chr> <chr> <chr> <lgl>
1 "" "" "" NA
2 "Parent Directory" "" "-" NA
3 "03005099999.csv" "2019-01-19 12:37" "20K" NA
4 "03075099999.csv" "2019-01-19 12:37" "20K" NA
5 "03091099999.csv" "2019-01-19 12:37" "17K" NA
6 "03159099999.csv" "2019-01-19 12:37" "20K" NA
7 "03262099999.csv" "2019-01-19 12:37" "20K" NA
8 "03311099999.csv" "2019-01-19 12:37" "19K" NA
9 "03379099999.csv" "2019-01-19 12:37" "33K" NA
10 "03396099999.csv" "2019-01-19 12:37" "21K" NA
# ... with 14 more rows
Get final URL after curl is redirected
curl
's -w
option and the sub variable url_effective
is what you are
looking for.
Something like
curl -Ls -o /dev/null -w %{url_effective} http://google.com
More info
-L Follow redirects
-s Silent mode. Don't output anything
-o FILE Write output to <file> instead of stdout
-w FORMAT What to output after completion
More
You might want to add -I
(that is an uppercase i
) as well, which will make the command not download any "body", but it then also uses the HEAD method, which is not what the question included and risk changing what the server does. Sometimes servers don't respond well to HEAD even when they respond fine to GET.
How to avoid null character when scraping a web page with getURL() in R?
RCurl::getURL()
seems to not be detecting either the Content-Encoding: gzip
header nor the tell-tale first two byte "magic" code that also signals the content is gzip encoded.
I would suggest — as Michael did — switching to httr
for reasons I'll go into in a bit, but this wld be a better httr
idiom:
library(httr)
res <- GET("http://dogecoin.com/")
content(res)
The content()
function extracts the raw response and returns an xml2
object which is similar to the XML
library parsed object that you would likely have been using given the use of RCurl::getURL()
.
The alternative way is to add some crutches to RCurl::getURL()
:
html_text_res <- RCurl::getURL("http://dogecoin.com/", encoding="gzip")
Here, we're explicitly informing getURL()
that the content is gzip'd, but that's fraught with peril in that if the upstream server decides to use, say, brotli encoding, then you'll get an error.
If you still want to use RCurl
vs switch to httr
I'd suggest doing the following for this site:
RCurl::getURL("http://dogecoin.com/",
encoding = "gzip",
httpheader = c(`Accept-Encoding` = "gzip"))
Here' were giving getURL()
the decoding crutch but also explicitly telling the upstream server that gzip is and that it should send data with that encoding.
However, httr
would be a better choice since it and the curl
package it uses deal with web server interaction and content in a more thorough way.
Related Topics
Removing/Replacing Brackets from R String Using Gsub
In R, Merge Two Data Frames, Fill Down The Blanks
Ggplot2 Positive and Negative Values Different Color Gradient
How to Get All Possible Combinations of N Number of Data Set
Coloring a Geom_Histogram by Gradient
Create Group Based on Fuzzy Criteria
R: Apply Function to Matrix with Elements of Vector as Argument
Mlogit: Missing Value Where True/False Needed
Get Start and End Index of Runs of Values
Spread with Duplicate Identifiers for Rows
Importing an Excel File with Greek Characters into R in The Correct Encoding
How to Convert Characters into Ascii Code
Error with Scale_X_Labels in Ggplot2
R: Remove Repeating Row Entries in Gridextra Table