R - How to Make a Click on Webpage Using Rvest or Rcurl

R - How to make a click on webpage using rvest or rcurl

Sometimes it's better to attack the problem at the ajax web-request level. For this site, you can use Chrome's dev tools and watch the requests. To build the table (the whole table, too) it makes a POST to the site with various ajax-y parameters. Just replicate that, do a bit of data-munging of the response and you're good to go:

library(httr)
library(rvest)
library(dplyr)

res <- POST("http://www.tradingeconomics.com/",
encode="form",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="http://www.tradingeconomics.com/",
`X-MicrosoftAjax`="Delta=true"),
body=list(
`ctl00$AjaxScriptManager1$ScriptManager1`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$UpdatePanel1|ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`srch-term`="",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$GridView1$ctl01$DropDownListCountry`="top",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$ParameterContinent`="",
`__ASYNCPOST`="false"))

res_t <- content(res, as="text")
res_h <- paste0(unlist(strsplit(res_t, "\r\n"))[-1], sep="", collapse="\n")

css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"

tab <- html(res_h) %>%
html_nodes(css) %>%
html_table()

tab[[1]]$COUNTRIESWORLDAMERICAEUROPEASIAAUSTRALIAAFRICA

glimpse(tab[[1]]

Another alternative would have been to use RSelenium to go to the page, click the "+" and then scrape the resultant table.

Scrape website that requires button click

If you are not scraping a large set of data. I will suggest to you to use selenium. With selenium actually you can click the button. You can begin with scraping with R programming and selenium.

You can also use PhantomJS. It is also like selenium but no browser required.

I hope one of them will help.

Using rvest, is it possible to click a tab that activates a div and reveals new content for scraping

RSelenium seems to offer all the functionality needed to harvest the data of interest. The best results might be achieved by combining the strength of rselenium with those of rvest.

Thank to everyone for their comments.

Webscrape text files using R, rvest or rcurl

Here's a function that works recursively to get all the links starting with the home directory. Note that it takes a bit to run:

library(xml2)
library(magrittr)
.get_link <- function(u){
node <- xml2::read_html(u)
hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
urls <- xml2::url_absolute(hrefs, xml_url(node))
if(!all(tools::file_ext(urls) == "txt")){
lapply(urls, .get_link)
}else {
return(urls)
}
}

What this is doing is basically starting with a url, and reading the contents, finding any links <a... using an xpath selector, which says "all links that are not ../" ie... not the topmost directory back link. then if the link has more links, loop through and get all of those as well. If we have the final links, ie, .txt files, we're done.

Example cheating and starting only at 2018

a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
> a[[1]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
> length(a)
[1] 365
> a[[365]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"

What you would do is simply start with: https://ais.sbarc.org/logs_delimited/ for the url input, and then add something like data.table::fread to digest the data. Which I would suggest doing in a separate iteration. Something like this works:

lapply(1:length(a), function(i){
lapply(a[[i]], data.table::fread)
})

For reading in data...

First thing to take notice of here is that there are 11,636 files. That's a lot of links to hit on someone's server at once... so I'm going to sample a few and show how to do it. I would suggest adding a Sys.sleep call into yours...

# This gets all the urls
a <- .get_link("https://ais.sbarc.org/logs_delimited/")
# This unlists and gives us a unique array of the urls
b <- unique(unlist(a))
# I'm sampling b, but you would just use `b` instead of `b[...]`
a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
df <- data.table::fread(i, sep = ";") %>% as.data.frame()
# Giving the file path for debug later if needed seems helpful
df$file_path <- i
df
}))

> a_dfs %>% head()
17:00:00:165 24 0 338179477 LAUREN SEA V8 V9 V15 V16 V17 V18 V19 V20 V21 V22 V23 file_path V1 V2 V3 V4
1 17:00:00:166 EUPHONY ACE 79 71.08 1 371618000 0 254.0 253 52 0 0 0 0 5 NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
2 17:00:01:607 SIMONE T BRUSCO 31 32.93 3 367593050 15 255.7 97 55 0 0 1 0 503 0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
3 17:00:01:626 POLARIS VOYAGER 89 148.80 1 311000112 0 150.0 151 53 0 0 0 0 0 22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
4 17:00:01:631 SPECTRE 60 25.31 1 367315630 5 265.1 511 55 0 0 1 0 2 20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
5 17:00:01:650 KEN EI 70 73.97 1 354162000 0 269.0 269 38 0 0 0 0 1 84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
6 17:00:02:866 HANNOVER BRIDGE 70 62.17 1 372104000 0 301.1 300 56 0 0 0 0 3 1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994 1 37 SRTG0$ 10 7 4 17:00:00:798 BROADBILL 16.84 269 18 367077090 16.3 -119.981493 34.402530 264.3 511 40
1 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA

Obviously some cleaning to do.. but this is how you get to it i'd think.

Edit 2

I actually like this better, read the data in, then split the string and create forcefull the dataframe:

a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
raw <- readLines(i)
str_matrix <- stringi::stri_split_regex(raw, "\\;", simplify = TRUE)
as.data.frame(apply(str_matrix, 2, function(j){
ifelse(!nchar(j), NA, j)
})) %>% mutate(file_name = i)
}))

> a_dfs %>% head
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1 09:59:57:746 STAR CARE 77 75.93 135 1 0 566341000 0 0 16.7 1 -118.839933 33.562167 321 322 50 0 0 0 0 6 19 <NA> <NA>
2 10:00:00:894 THALATTA 70 27.93 133.8 1 0 229710000 0 251 17.7 1 -119.366765 34.101742 283.9 282 55 0 0 0 0 7 <NA> <NA> <NA>
3 10:00:03:778 GULF GLORY 82 582.3 256 1 0 538007706 0 0 12.4 0 -129.345783 32.005983 87 86 54 0 0 0 0 2 20 <NA> <NA>
4 10:00:03:799 MAGPIE SW 70 68.59 123.4 1 0 352597000 0 0 10.9 0 -118.747970 33.789747 119.6 117 56 0 0 0 0 0 22 <NA> <NA>
5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7 1 0 311056900 0 11 12 1 -120.846763 34.401482 105.8 106 56 0 0 0 0 6 21 <NA> <NA>
6 10:00:12:870 RANGER 85 60 31.39 117.9 1 0 367044250 0 128 0 1 -119.223133 34.162953 360 511 56 0 0 1 0 2 21 <NA> <NA>
file_name V26 V27
1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>

Extract Links from Webpage using R

The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

My previous reply:

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> html <- paste(readLines(url), collapse="\n")
> library(stringr)
> matched <- str_match_all(html, "<a href=\"(.*?)\"")

(I guess some people might not approve of using regexp's here.)

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2]
> head(links)
[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
[2] "http://careers.stackoverflow.com"
[3] "http://meta.stackoverflow.com"
[4] "/about"
[5] "/faq"
[6] "/"

Use RCurl to bypass disclaimer page then do the web scraping

As I mention in my comment, the solution to your problem will totally depend on the implementation of the "disclaimer page." It looks like the previous solution used cURL options defined in more detail here. Basically, what it's instructing cURL to do is to provide a fake cookies file (named "nosuchfile") and then followed the header redirect given by the site you were trying to access. Apparently that site was setup in such a way that if a visitor claimed not to have the proper cookies, then it would immediately redirect the visitor past the disclaimer page.

You didn't happen to create a file named "nosuchfile" in your working directory, did you? If not, it sounds like the target site changed the way its disclaimer page operates. If that's the case, there's really no help we can provide unless we have the actual page you're trying to access to diagnose.

In the example you reference in your question, they're using Javascript to move past the disclaimer, which could be tricky to get past.

For the example you mention, however...

  1. Open it in Chrome (or Firefox with Firebug)
  2. Right click on some blank space in the page and select "Inspect Element"
  3. Click the Network tab
  4. If there's content there, click the "Clear" button at the bottom to empty out the page.
  5. Accept the license agreement
  6. Watch all of the traffic that comes across the network. In my case, the top result was the interesting one. If you click it, you can preview it to verify that it is, indeed, an HTML document. If you click on the "Headers" tab under that item, it will show you the "Request URL". In my case, that was: http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundPriceDividend&pri_fund_code=U42360&data_selection=0&keyword=U42360&start_day=30&start_month=03&start_year=2012&end_day=18&end_month=04&end_year=2012&data_selection2=0

You can access that URL directly without having to accept any license agreement, either by hand or from cURL.

Note that if you've already accepted the agreement, this site stores a cookie stating such which will need to be deleted in order to get back to the license agreement page. You can do this by clicking the "Resources" tab, then going to "Cookies" and deleting each one, then refreshing the URL you posted above.

How do you click an anchor tag link using rselenium and rvest in R?

UPDATE

In case you want to click on the Sign in button instead of straight navigating to its web address you can do the whole process as follows:

page<-"https://www.glassdoor.co.in"
remDr$navigate(page)
accept_cokies_btn <- remDr$findElement(using = "xpath", '//*[@id="onetrust-accept-btn-handler"]')
accept_cokies_btn$clickElement()
signin <- remDr$findElement(using = "xpath", '//*[@id="TopNav"]/nav/div/div/div[4]/div[1]/a')
signin$clickElement()

And after this same as previously for submitting your username and password:

username_btn <- remDr$findElement(using ="name" , "username")
username_btn$sendKeysToElement(list("add username here"))
pass_btn <- remDr$findElement(using = "name", "password")
pass_btn$sendKeysToElement(list("add password here", "\uE007"))

Note that I have added the click on one more button; the cookies accept button.

Using R to click a download file button on a webpage

Just mimic the POST it does:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

POST("http://volcano.si.edu/search_eruption_results.cfm",
body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "", cp = "1"),
encode = "form") -> res

content(res, as="parsed") %>%
html_nodes("div.DivTableSearch") %>%
html_nodes("div.tr") %>%
map(html_children) %>%
map(html_text) %>%
map(as.list) %>%
map_df(setNames, c("volcano_name", "subregion", "eruption_type",
"start_date", "max_vei", "X1")) %>%
select(-X1)
## # A tibble: 750 × 5
## volcano_name subregion eruption_type start_date
## <chr> <chr> <chr> <chr>
## 1 Chirinkotan Kuril Islands Confirmed Eruption 2016 Nov 29
## 2 Zhupanovsky Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3 Kerinci Sumatra Confirmed Eruption 2016 Nov 15
## 4 Langila New Britain Confirmed Eruption 2016 Nov 3
## 5 Cleveland Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6 Ebeko Kuril Islands Confirmed Eruption 2016 Oct 20
## 7 Ulawun New Britain Confirmed Eruption 2016 Oct 11
## 8 Karymsky Kamchatka Peninsula Confirmed Eruption 2016 Oct 5
## 9 Ubinas Peru Confirmed Eruption 2016 Oct 2
## 10 Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>

I assumed the "Excel" part could be inferred, but if not:

POST("http://volcano.si.edu/search_eruption_excel.cfm", 
body = list(`eruption_category[]` = "",
`country[]` = ""),
encode = "form",
write_disk("eruptions.xls")) -> res


Related Topics



Leave a reply



Submit