What if I want to web scrape with R for a page with parameters?
You can use RHTMLForms
You may need to install it first:
# install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
or under windows you may need
# install.packages("RHTMLForms", repos = "http://www.omegahat.org/R", type = "source")
require(RHTMLForms)
require(RCurl)
require(XML)
forms = getHTMLFormDescription("http://stoptb.org/countries/tbteam/experts.asp")
fun = createFunction(forms$sExperts)
# find experts with expertise in "Infection control: Engineering Consultant"
results <- fun(Expertise = "Infection control: Engineering Consultant")
tableData <- getNodeSet(htmlParse(results), "//*/table[@class = 'data']")
readHTMLTable(tableData[[1]])
# V1 V2 V3
#1 <NA> <NA>
#2 Name of Expert Country of Residence Email
#3 Girmay, Desalegn Ethiopia deskebede@yahoo.com
#4 IVANCHENKO, VARVARA Estonia v.ivanchenko81@mail.ru
#5 JAUCOT, Alex Belgium alex.jaucot@gmail.com
#6 Mulder, Hans Johannes Henricus Namibia hmulder@iway.na
#7 Walls, Neil Australia neil@nwalls.com
#8 Zuccotti, Thea Italy thea_zuc@yahoo.com
# V4
#1 <NA>
#2 Number of Missions
#3 0
#4 3
#5 0
#6 0
#7 0
#8 1
or create a reader to return a table
returnTable <- function(results){
tableData <- getNodeSet(htmlParse(results), "//*/table[@class = 'data']")
readHTMLTable(tableData[[1]])
}
fun = createFunction(forms$sExperts, reader = returnTable)
fun(CBased = "Bhutan") # find experts based in Bhutan
# V1 V2 V3
#1 <NA> <NA>
#2 Name of Expert Country of Residence Email
#3 Wangchuk, Lungten Bhutan drlungten@health.gov.bt
# V4
#1 <NA>
#2 Number of Missions
#3 2
Web Scraping in R to extract data from multiple pages
Observations:
If you scroll down the page you will see there is an option to request more results in a batch. In this particular case, setting to the maximum batch size returns all results in one go.
Monitoring the web traffic shows no additional traffic when requesting more results meaning the data is present in the original response.
Doing a search of the page source for the last symbol reveals the all the items are pre-loaded in a script tag. Examining the JavaScript source files shows the instructions for pushing new batches onto the page based on various input params.
Solution:
You can simply extract the JavaScript object from the script
tag and parse as JSON. Convert the list of lists to a dataframe then add in a constructed url based on common base string + symbol.
TODO:
- Up to you if you wish to update header names
- You may wish to format the numeric column to match webpage format
R:
library(rvest)
library(jsonlite)
library(tidyverse)
r <- read_html('https://stockanalysis.com/stocks/') %>%
html_element('#__NEXT_DATA__') %>%
html_text() %>%
jsonlite::parse_json()
df <- map_df(r$props$pageProps$stocks, ~ .x)%>%
mutate(url = paste0('https://stockanalysis.com/stocks/', s))
Sample output:
R - web scraping dynamic form with inputs
Here is a solution using RSelenium
for downloading data for
- State = Andhra Pradesh
- District = Adilabadt
- Tahsil = Mancherial
- Tables = Average Size of Operational Holding by Size Group
The remaining fields use the default input parameters.
library(RSelenium)
library(XML)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
remDrv$findElement(using = "xpath",
"//option[@value = '33']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Click submit
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
# Retrieve and download results ------------------------------------------------
table <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4)
remDrv$quit()
remDrv$closeServer()
head(table)
# V1 V2 V3
# 1 SI No. Size of Holding(in ha.) Institutional Holdings
# 2 (1) (2) (3)
# 3 1 MARGINAL 0
# 4 2 SMALL 0
# 5 3 SEMIMEDIUM 0
# 6 4 MEDIUM 0
However, the static solution above only answers parts of your questions, namely how to fill out the web form using R.
The tricky thing on your web page is that the values in the different drop-down menus depend on each other.
Below, you will find a solution which takes into account those dependencies without the need that you know the respective district and tehsils IDs upfront.
The code below downloads data for
- State = GOA
- Tables = Average Size of Operational Holding by Size Group
including all districts and all tehsils. I used GOA as the primary anchor but you can easily select another state of your choice.
library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
# Select 27a == GOA as the anchor
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
# Select 4 == Average Size of Operational Holding by Size Group
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Get all district IDs and the respective names belonging to GOA
district_IDs <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
district_names <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
# Retrieve and download results ------------------------------------------------
result <- data.frame(district = character(), teshil = character(),
V1 = character(), V2 = character(), V3 = character())
for (i in seq_along(district_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ",
"'", district_IDs[i], "']"))$clickElement()
Sys.sleep(2)
# Get all tehsil IDs and names from the currently selected district
tehsil_IDs <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
tehsil_names <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
for (j in seq_along(tehsil_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
"'", tehsil_IDs[j], "']"))$clickElement()
Sys.sleep(2)
# Click submit and download data of the selected tehsil
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
Sys.sleep(2)
# Download data for current tehsil
tehsil_data <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4) %>%
extract(c(-1, -2), )
result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
tehsil_data) %>% rbind(result, .)
remDrv$goBack()
Sys.sleep(2)
}
}
remDrv$quit()
remDrv$closeServer()
result %<>% as_data_frame %>%
rename(
si_no = V1,
holding_size = V2,
inst_holdings = V3
) %>%
mutate(
si_no = as.numeric(as.character(si_no)),
inst_holdings = as.numeric(as.character(inst_holdings))
)
dim(result)
# [1] 66 5
head(result)
# district tehsil si_no holding_size inst_holdings
# 1 NORTH GOA ponda 1 MARGINAL 0.34
# 2 NORTH GOA ponda 2 SMALL 0.00
# 3 NORTH GOA ponda 3 SEMIMEDIUM 2.50
# 4 NORTH GOA ponda 4 MEDIUM 0.00
# 5 NORTH GOA ponda 5 LARGE 182.64
# 6 NORTH GOA ponda 6 ALL SIZE CLASS 41.09
tail(result)
# district tehsil si_no holding_size inst_holdings
# 1 SOUTH GOA quepem 1 MARGINAL 0.30
# 2 SOUTH GOA quepem 2 SMALL 0.00
# 3 SOUTH GOA quepem 3 SEMIMEDIUM 0.00
# 4 SOUTH GOA quepem 4 MEDIUM 0.00
# 5 SOUTH GOA quepem 5 LARGE 23.50
# 6 SOUTH GOA quepem 6 ALL SIZE CLASS 15.77
RSelenium even supports headless browsing leveraging PhantomJS as described in this vignette.
Web-Scraping with R
You picked a tough problem to learn on.
This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...)
just grabs the base html and parses that. So the links you want are simply not present.
AFAIK the only way around this is to use the RSelenium
package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium
is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium
.
Once you've done that, inspection of the source in a browser shows that the article links are all in the href
attribute of anchor tags which have class=doclink
. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.
library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer() # download Selenium Server, if not already presnet
startServer() # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open() # open connection
remDr$navigate(url) # grab and process the page (including scripts)
doc <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
# [7] "http://www.calcharge.org/2014/07/"
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
Web-scraping data from pages with forms
RSelenium, decapitated
and splashr
all introduce third-party dependencies which can be difficult to setup and maintain.
No browser instrumentation is required here so no need for RSelenium. decapitated
won't really help much either and splashr
is a bit overkill for this use-case.
The form you see on the site is a proxy to a Solr database. Open up Developer Tools on your browser on that URL hit refresh and look at the XHR section of the Network section. You'll see it makes asynchronous requests on load and with each form interaction.
All we have to do is mimic those interactions. The source below is heavily annotated and you might want to step through them manually to see what's going on under the hood.
We'll need some helpers:
library(xml2)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(tidyverse)
Most of ^^ get loaded anyway when you load rvest
but I like being explicit. Also, stringr
is an unnecessary crutch for the far more explicit-in-operation named stringi
functions, so we'll use them.
First, we get the list of sites. This function mimics the POST
request you hopefully saw when you took the advice to use Developer Tools to see what's going on:
get_list_of_sites <- function() {
# This is the POST reques the site makes to get the metdata for the popups.
# I used http://gitlab.com/hrbrmstr/curlconverter to untangle the monstosity
httr::POST(
url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
body = list(
q = "*%3A*",
host = "padme.rbge.org.uk",
c = "neotroptree",
template = "countries.tpl",
datasetid = "",
f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
),
encode = "form"
) -> res
httr::stop_for_status(res)
# extract the returned JSON from the HTML document it returns
xdat <- jsonlite::fromJSON(html_text(content(res, encoding="UTF-8")))
# only return the site list (the xdat structure had alot more in it tho)
discard(xdat$facets$sitename_s, stri_detect_regex, "^[[:digit:]]+$")
}
We'll call that below but it just returns a character vector of site names.
Now we need a function to get the site data returned in the lower portion of the form output. This is doing the same thing as above except it adds in the ability to take a site to download and where it should store the file. overwrite
is handy since you may be doing alot of downloads and try to download the same file again. Since we're using httr::write_disk()
to save the file, setting this parameter to FALSE
will cause an exception and stop any loop/iteration you've got. You likely don't want that.
get_site <- function(site, dl_path, overwrite=TRUE) {
# this is the POST request the site makes as an XHR request so we just
# mimic it with httr::POST. We pass in the site code in `q`
httr::POST(
url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
body = list(
q = sprintf('sitename_s:"%s"', curl::curl_escape(site)),
host = "padme.rbge.org.uk",
c = "neotroptree",
template = "countries.tpl",
datasetid = "",
f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
),
encode = "form"
) -> res
httr::stop_for_status(res)
# it returns a JSON structure
xdat <- httr::content(res, as="text", encoding="UTF-8")
xdat <- jsonlite::fromJSON(xdat)
# unfortunately the bit with the site-id is in HTML O_o
# so we have to parse that bit out of the returned JSON
site_meta <- xml2::read_html(xdat$docs)
# now, extract the link code
link <- html_attr(html_node(site_meta, "div.solrlink"), "data-linkparams")
link <- stri_replace_first_regex(link, "code_s:", "")
# Download the file and get the filename metadata back
xret <- get_link(link, dl_path) # the code for this is below
# add the site name
xret$site <- site
# return the list
xret[c("code", "site", "path")]
}
I put the code for retrieving the file into a separate function since it seemed to make sense to encapsulate this functionality into a separate function. YMMV. I took the liberty of removing the nonsensical ,
in filenames as well.
get_link <- function(code, dl_path, overwrite=TRUE) {
# The Download link looks like this:
#
# <a href="http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=AtlMG104">
# Download site details.
# </a>
#
# So we can mimic that with httr
site_tmpl <- "http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=%s"
dl_url <- sprintf(site_tmpl, code)
# The filename comes in a "Content-Disposition" header so we first
# do a lightweight HEAD request to get the filename
res <- httr::HEAD(dl_url)
httr::stop_for_status(res)
stri_replace_all_regex(
res$headers["content-disposition"],
'^attachment; filename="|"$', ""
) -> fil_name
# commas in filenames are a bad idea rly
fil_name <- stri_replace_all_fixed(fil_name, ",", "-")
message("Saving ", code, " to ", file.path(dl_path, fil_name))
# Then we use httr::write_disk() to do the saving in a full GET request
res <- httr::GET(
url = dl_url,
httr::write_disk(
path = file.path(dl_path, fil_name),
overwrite = overwrite
)
)
httr::stop_for_status(res)
# return a list so we can make a data frame
list(
code = code,
path = file.path(dl_path, fil_name)
)
}
Now, we get the list of sites (as promised):
# get the site list
sites <- get_list_of_sites()
length(sites)
## [1] 7484
head(sites)
## [1] "Abadia, cerrado"
## [2] "Abadia, floresta semidecídua"
## [3] "Abadiânia, cerrado"
## [4] "Abaetetuba, Rio Urubueua, floresta inundável de maré"
## [5] "Abaeté, cerrado"
## [6] "Abaeté, floresta ripícola"
We'll grab one site ZIP file:
# get one site link dl
get_site(sites[1], "/tmp")
## $code
## [1] "CerMG044"
##
## $site
## [1] "Abadia, cerrado"
##
## $path
## [1] "/tmp/neotroptree-CerMG04426-09-2018.zip"
Now, get a few more and return a data frame with code, site and save path:
# get a few (remomove [1:2] to do them all but PLEASE ADD A Sys.sleep(5) into get_link() if you do!)
map_df(sites[1:2], get_site, dl_path = "/tmp")
## # A tibble: 2 x 3
## code site path
## <chr> <chr> <chr>
## 1 CerMG044 Abadia, cerrado /tmp/neotroptree-CerMG04426-09-20…
## 2 AtlMG104 Abadia, floresta semidecídua /tmp/neotroptree-AtlMG10426-09-20…
Please heed the guidance to add a Sys.sleep(5)
into get_link()
if you're going to do a mass download. CPU, memory and bandwidth aren't free and it's likely that site didn't really scale the server to meet a barrage of ~8,000 back-to-back multi-HTTP request call sequence with file downloads at the end of them.
How to automate multiple requests to a web search form using R (Java function calls / triger)
The information to recreate the calculator is given on the webpage. For example to
calculate the CVD 10 year risk for a male:
cvdRiskmale <- function(age, SBP, treated, smoke, dia, HDL, TC){
eSum <- (log(age)*3.06117 +treated*1.99881*log(SBP) +(1-treated)*1.93303*log(SBP))
eSum <- eSum + (smoke*0.65451 +dia*0.57367 -0.93263*log(HDL) + 1.12370*log(TC) )
1-0.88936^exp(eSum - 23.9802)
}
> cvdRiskmale(35, 125, 0,0, 0, 45, 180)
[1] 0.02638287
> cvdRiskmale(50, 115, 0,1, 1, 45, 180)
[1] 0.2067156
compare with calculator with same options.
A similar function can be defined for females given the regression coefficients listed on the website.
Web scraping in R with Selenium to click new pages
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page
param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
R Web Scraping Multiple Levels of a Website
The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.
Most importantly, giving credit where it's due @hrbrmstr:
R web scraping across multiple pages
The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.
library(rvest)
library(purrr)
library(stringr)
library(dplyr)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names
names <- names[-c(3,4)]
#drop the head of department and blank space
names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names
content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()
content <- content[! content %in% "+"]
#drop the "+" from the content
content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on
links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links
url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages
prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data
prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url
page <- read_html(x)
#read each page in the urls vector
sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title
info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section
data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns
})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead
prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages
Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.
Parsing Web page with R
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm
from RCurl
package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p
. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.
Related Topics
Check If a String Contains at Least One Numeric Character in R
Using Rollmean When There Are Missing Values (Na)
How to Create a Histogram from Aggregated Data in R
How to Output a Stem and Leaf Plot as a Plot
Select Last Row by Group for All Columns Data.Table
How to Use Variables Newly Created in 'J' in the Same 'J' Argument
R: Building a Simple Command Line Plotting Tool/Capturing Window Close Events
Installing Rcppeigen on Amazon Ec2
How to Speed Up or Vectorize a for Loop
Predict() with Arbitrary Coefficients in R
Gcc: Error: Libgomp.Spec: No Such File or Directory with Amazon Linux 2017.09.1
R Ggplot2 Boxplots - Ggpubr Stat_Compare_Means Not Working Properly
R - How to Use Selectinput in Shiny to Change the X and Fill Variables in a Ggplot Renderplot
Loading Dplyr After Plyr Is Causing Issues
How to Combine Multiple .CSV Files in R