Using R to "Click" a Download File Button on a Webpage

Using R to click a download file button on a webpage

Just mimic the POST it does:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

POST("http://volcano.si.edu/search_eruption_results.cfm",
body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "", cp = "1"),
encode = "form") -> res

content(res, as="parsed") %>%
html_nodes("div.DivTableSearch") %>%
html_nodes("div.tr") %>%
map(html_children) %>%
map(html_text) %>%
map(as.list) %>%
map_df(setNames, c("volcano_name", "subregion", "eruption_type",
"start_date", "max_vei", "X1")) %>%
select(-X1)
## # A tibble: 750 × 5
## volcano_name subregion eruption_type start_date
## <chr> <chr> <chr> <chr>
## 1 Chirinkotan Kuril Islands Confirmed Eruption 2016 Nov 29
## 2 Zhupanovsky Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3 Kerinci Sumatra Confirmed Eruption 2016 Nov 15
## 4 Langila New Britain Confirmed Eruption 2016 Nov 3
## 5 Cleveland Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6 Ebeko Kuril Islands Confirmed Eruption 2016 Oct 20
## 7 Ulawun New Britain Confirmed Eruption 2016 Oct 11
## 8 Karymsky Kamchatka Peninsula Confirmed Eruption 2016 Oct 5
## 9 Ubinas Peru Confirmed Eruption 2016 Oct 2
## 10 Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>

I assumed the "Excel" part could be inferred, but if not:

POST("http://volcano.si.edu/search_eruption_excel.cfm", 
body = list(`eruption_category[]` = "",
`country[]` = ""),
encode = "form",
write_disk("eruptions.xls")) -> res

Download data from “click download button” on a webpage with R

I found many ways to get the data, but the easiest method was the following:

I right_clicked the download icon in the webpage and copied the link from there.
Basically, I used the download.file() function to download the data in a specific directory. Very easy.

url_link <- paste0("https://apiapex.tesouro.gov.br/aria/", "v1/sistd/custom/historico?ano=2021&idSigla=1")

download.file(url = url_link, destfile = "Downloads/file_name.xls")

Using R to mimic “clicking” a download file button on a webpage

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.

The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.

library(jsonlite)

url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'

if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Is there an R function to click a download button?

The buttons just link to .xls files. So, you could use the URL of the page that the button points to, and use that URL as an input to a script / function that does the scraping. There are plenty of packages like rcurl that you could use to manage the download.

Use R to mimic clicking on a file to download it

Some libs to help

You actually will need only dplyr, purrr, stringr, and xml2.

library(tidyverse)
library(rvest)
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(htmltab)
library(xml2)
library(readxl)

I like to do it this way because some sites use partial links.

base <- "https://rigcount.bakerhughes.com"
url <- paste0(base, "/na-rig-count")

# find links
url_html <- xml2::read_html(url)
url_html %>%
html_nodes("a") %>%
html_attrs() %>%
bind_rows() -> url_tbl

Check href content, find some pattern you are interested in.
You may use inspect on your browser too, it is truly helpful.

url_tbl %>% 
count(href)
#> # A tibble: 22 x 2
#> href n
#> <chr> <int>
#> 1 / 1
#> 2 /email-alerts 1
#> 3 /intl-rig-count 1
#> 4 /na-rig-count 1
#> 5 /rig-count-faqs 1
#> 6 /rig-count-overview 2
#> 7 #main-menu 1
#> 8 https://itunes.apple.com/app/baker-hughes-rig-counts/id393570114?mt=8 1
#> 9 https://rigcount.bakerhughes.com/static-files/4ab04723-b638-4310-afd9-… 1
#> 10 https://rigcount.bakerhughes.com/static-files/4b92b553-a48d-43a3-b4d9-… 1
#> # … with 12 more rows

Perhaps, I notice that static-files may be a good pattern to href but then I found a better one, in type.

url_tbl %>% 
filter(str_detect(type, "ms-excel")) -> url_xlsx

build our list (remember to avoid some noise as an extra dot, spaces, and special characters)
I hope someone proposes a better way to avoid those things

myFiles <- pull(url_xlsx, "href")
names <- pull(url_xlsx, "title")
names(myFiles) <- paste0(
str_replace_all(names, "[\\.\\-\\ ]", "_"),
str_extract(names, ".\\w+$")
)

# download data
myFiles %>%
imap(
~ download.file(
url = .x,
destfile = .y,
method="curl", # might be not necessary
extra="-k"
)
)
#> $`north_america_rotary_rig_count_jan_2000_-_current.xlsb`
#> [1] 0
#>
#> $`north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb`
#> [1] 0
#>
#> $`U.S. Monthly Averages by State 1992-2016.xls`
#> [1] 0
#>
#> $`North America Rotary Rig Counts through 2016.xls`
#> [1] 0
#>
#> $`U.S. Annual Averages by State 1987-2016.xls`
#> [1] 0
#>
#> $Workover_9.xls
#> [1] 0

Created on 2020-12-16 by the reprex package (v0.3.0)

Now you may see your files.

names(myFiles) %>%
map(
readxlsb:read_xlsb
) -> myData

I hope it helps.

Imitate click on download button from R

This site gets the data from a GraphQL API call :

POST https://comptox.epa.gov/dashboard/graphql

And when you click the download button it sends the same data to another api to format the data and download the tsv.

You can get the data from the API and format it into a tsv file in the first place :

library(httr)

query <- "{
lists {
label
abbreviation
short_description
chemical_count
updated_at
}
}"

r <- POST("https://comptox.epa.gov/dashboard/graphql",
content_type("application/json"),
body = list(
query = query
), encode = "json")

data <- content(r, "parsed")

#add link column
for(i in seq_along(data$data$lists)){
data$data$lists[[i]]$acronym <- paste("https://comptox.epa.gov/dashboard/chemical_lists", data$data$lists[[i]]$abbreviation, sep="/")
}

#convert to dataframe
df <- do.call(rbind.data.frame, data$data$lists)

write.table(df, file = "chemical.tsv", row.names=FALSE, sep="\t", quote = FALSE)

Output of file :

Sample Image



Related Topics



Leave a reply



Submit