What If I Want to Web Scrape with R for a Page with Parameters

What if I want to web scrape with R for a page with parameters?

You can use RHTMLForms

You may need to install it first:

# install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")

or under windows you may need

# install.packages("RHTMLForms", repos = "http://www.omegahat.org/R", type = "source")

require(RHTMLForms)
require(RCurl)
require(XML)
forms = getHTMLFormDescription("http://stoptb.org/countries/tbteam/experts.asp")
fun = createFunction(forms$sExperts)
# find experts with expertise in "Infection control: Engineering Consultant"
results <- fun(Expertise = "Infection control: Engineering Consultant")

tableData <- getNodeSet(htmlParse(results), "//*/table[@class = 'data']")
readHTMLTable(tableData[[1]])

# V1 V2 V3
#1 <NA> <NA>
#2 Name of Expert Country of Residence Email
#3 Girmay, Desalegn Ethiopia deskebede@yahoo.com
#4 IVANCHENKO, VARVARA Estonia v.ivanchenko81@mail.ru
#5 JAUCOT, Alex Belgium alex.jaucot@gmail.com
#6 Mulder, Hans Johannes Henricus Namibia hmulder@iway.na
#7 Walls, Neil Australia neil@nwalls.com
#8 Zuccotti, Thea Italy thea_zuc@yahoo.com
# V4
#1 <NA>
#2 Number of Missions
#3 0
#4 3
#5 0
#6 0
#7 0
#8 1

or create a reader to return a table

 returnTable <- function(results){
tableData <- getNodeSet(htmlParse(results), "//*/table[@class = 'data']")
readHTMLTable(tableData[[1]])
}
fun = createFunction(forms$sExperts, reader = returnTable)
fun(CBased = "Bhutan") # find experts based in Bhutan
# V1 V2 V3
#1 <NA> <NA>
#2 Name of Expert Country of Residence Email
#3 Wangchuk, Lungten Bhutan drlungten@health.gov.bt
# V4
#1 <NA>
#2 Number of Missions
#3 2

Web Scraping in R to extract data from multiple pages

Observations:

If you scroll down the page you will see there is an option to request more results in a batch. In this particular case, setting to the maximum batch size returns all results in one go.

Monitoring the web traffic shows no additional traffic when requesting more results meaning the data is present in the original response.

Doing a search of the page source for the last symbol reveals the all the items are pre-loaded in a script tag. Examining the JavaScript source files shows the instructions for pushing new batches onto the page based on various input params.


Solution:

You can simply extract the JavaScript object from the script tag and parse as JSON. Convert the list of lists to a dataframe then add in a constructed url based on common base string + symbol.


TODO:

  1. Up to you if you wish to update header names
  2. You may wish to format the numeric column to match webpage format

R:

library(rvest)
library(jsonlite)
library(tidyverse)

r <- read_html('https://stockanalysis.com/stocks/') %>%
html_element('#__NEXT_DATA__') %>%
html_text() %>%
jsonlite::parse_json()

df <- map_df(r$props$pageProps$stocks, ~ .x)%>%
mutate(url = paste0('https://stockanalysis.com/stocks/', s))

Sample output:

Sample Image

R - web scraping dynamic form with inputs

Here is a solution using RSelenium for downloading data for

  1. State = Andhra Pradesh
  2. District = Adilabadt
  3. Tahsil = Mancherial
  4. Tables = Average Size of Operational Holding by Size Group

The remaining fields use the default input parameters.

library(RSelenium)
library(XML)
library(magrittr)

# Start Selenium Server --------------------------------------------------------

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()

# Simulate browser session and fill out form -----------------------------------

remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
remDrv$findElement(using = "xpath",
"//option[@value = '33']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Click submit
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()

# Retrieve and download results ------------------------------------------------

table <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4)

remDrv$quit()
remDrv$closeServer()

head(table)

# V1 V2 V3
# 1 SI No. Size of Holding(in ha.) Institutional Holdings
# 2 (1) (2) (3)
# 3 1 MARGINAL 0
# 4 2 SMALL 0
# 5 3 SEMIMEDIUM 0
# 6 4 MEDIUM 0

However, the static solution above only answers parts of your questions, namely how to fill out the web form using R.

The tricky thing on your web page is that the values in the different drop-down menus depend on each other.

Below, you will find a solution which takes into account those dependencies without the need that you know the respective district and tehsils IDs upfront.

The code below downloads data for

  1. State = GOA
  2. Tables = Average Size of Operational Holding by Size Group

including all districts and all tehsils. I used GOA as the primary anchor but you can easily select another state of your choice.

library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)

# Start Selenium Server --------------------------------------------------------

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()

# Simulate browser session and fill out form -----------------------------------

remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')

# Select 27a == GOA as the anchor
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
# Select 4 == Average Size of Operational Holding by Size Group
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()

# Get all district IDs and the respective names belonging to GOA
district_IDs <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist

district_names <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist

# Retrieve and download results ------------------------------------------------

result <- data.frame(district = character(), teshil = character(),
V1 = character(), V2 = character(), V3 = character())

for (i in seq_along(district_IDs)) {

remDrv$findElement(using = "xpath",
paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ",
"'", district_IDs[i], "']"))$clickElement()
Sys.sleep(2)

# Get all tehsil IDs and names from the currently selected district
tehsil_IDs <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist

tehsil_names <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist

for (j in seq_along(tehsil_IDs)) {

remDrv$findElement(using = "xpath",
paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
"'", tehsil_IDs[j], "']"))$clickElement()
Sys.sleep(2)

# Click submit and download data of the selected tehsil
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
Sys.sleep(2)

# Download data for current tehsil
tehsil_data <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4) %>%
extract(c(-1, -2), )

result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
tehsil_data) %>% rbind(result, .)

remDrv$goBack()
Sys.sleep(2)
}
}

remDrv$quit()
remDrv$closeServer()

result %<>% as_data_frame %>%
rename(
si_no = V1,
holding_size = V2,
inst_holdings = V3
) %>%
mutate(
si_no = as.numeric(as.character(si_no)),
inst_holdings = as.numeric(as.character(inst_holdings))
)

dim(result)
# [1] 66 5

head(result)
# district tehsil si_no holding_size inst_holdings
# 1 NORTH GOA ponda 1 MARGINAL 0.34
# 2 NORTH GOA ponda 2 SMALL 0.00
# 3 NORTH GOA ponda 3 SEMIMEDIUM 2.50
# 4 NORTH GOA ponda 4 MEDIUM 0.00
# 5 NORTH GOA ponda 5 LARGE 182.64
# 6 NORTH GOA ponda 6 ALL SIZE CLASS 41.09

tail(result)
# district tehsil si_no holding_size inst_holdings
# 1 SOUTH GOA quepem 1 MARGINAL 0.30
# 2 SOUTH GOA quepem 2 SMALL 0.00
# 3 SOUTH GOA quepem 3 SEMIMEDIUM 0.00
# 4 SOUTH GOA quepem 4 MEDIUM 0.00
# 5 SOUTH GOA quepem 5 LARGE 23.50
# 6 SOUTH GOA quepem 6 ALL SIZE CLASS 15.77

RSelenium even supports headless browsing leveraging PhantomJS as described in this vignette.

Web-Scraping with R

You picked a tough problem to learn on.

This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...) just grabs the base html and parses that. So the links you want are simply not present.

AFAIK the only way around this is to use the RSelenium package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium.

Once you've done that, inspection of the source in a browser shows that the article links are all in the href attribute of anchor tags which have class=doclink. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer() # download Selenium Server, if not already presnet
startServer() # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open() # open connection
remDr$navigate(url) # grab and process the page (including scripts)
doc <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
# [7] "http://www.calcharge.org/2014/07/"
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"

Web-scraping data from pages with forms

RSelenium, decapitated and splashr all introduce third-party dependencies which can be difficult to setup and maintain.

No browser instrumentation is required here so no need for RSelenium. decapitated won't really help much either and splashr is a bit overkill for this use-case.

The form you see on the site is a proxy to a Solr database. Open up Developer Tools on your browser on that URL hit refresh and look at the XHR section of the Network section. You'll see it makes asynchronous requests on load and with each form interaction.

All we have to do is mimic those interactions. The source below is heavily annotated and you might want to step through them manually to see what's going on under the hood.

We'll need some helpers:

library(xml2)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(tidyverse)

Most of ^^ get loaded anyway when you load rvest but I like being explicit. Also, stringr is an unnecessary crutch for the far more explicit-in-operation named stringi functions, so we'll use them.

First, we get the list of sites. This function mimics the POST request you hopefully saw when you took the advice to use Developer Tools to see what's going on:

get_list_of_sites <- function() {

# This is the POST reques the site makes to get the metdata for the popups.
# I used http://gitlab.com/hrbrmstr/curlconverter to untangle the monstosity
httr::POST(
url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
body = list(
q = "*%3A*",
host = "padme.rbge.org.uk",
c = "neotroptree",
template = "countries.tpl",
datasetid = "",
f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
),
encode = "form"
) -> res

httr::stop_for_status(res)

# extract the returned JSON from the HTML document it returns
xdat <- jsonlite::fromJSON(html_text(content(res, encoding="UTF-8")))

# only return the site list (the xdat structure had alot more in it tho)
discard(xdat$facets$sitename_s, stri_detect_regex, "^[[:digit:]]+$")

}

We'll call that below but it just returns a character vector of site names.

Now we need a function to get the site data returned in the lower portion of the form output. This is doing the same thing as above except it adds in the ability to take a site to download and where it should store the file. overwrite is handy since you may be doing alot of downloads and try to download the same file again. Since we're using httr::write_disk() to save the file, setting this parameter to FALSE will cause an exception and stop any loop/iteration you've got. You likely don't want that.

get_site <- function(site, dl_path, overwrite=TRUE) {

# this is the POST request the site makes as an XHR request so we just
# mimic it with httr::POST. We pass in the site code in `q`

httr::POST(
url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
body = list(
q = sprintf('sitename_s:"%s"', curl::curl_escape(site)),
host = "padme.rbge.org.uk",
c = "neotroptree",
template = "countries.tpl",
datasetid = "",
f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
),
encode = "form"
) -> res

httr::stop_for_status(res)

# it returns a JSON structure
xdat <- httr::content(res, as="text", encoding="UTF-8")
xdat <- jsonlite::fromJSON(xdat)

# unfortunately the bit with the site-id is in HTML O_o
# so we have to parse that bit out of the returned JSON
site_meta <- xml2::read_html(xdat$docs)

# now, extract the link code
link <- html_attr(html_node(site_meta, "div.solrlink"), "data-linkparams")
link <- stri_replace_first_regex(link, "code_s:", "")

# Download the file and get the filename metadata back
xret <- get_link(link, dl_path) # the code for this is below

# add the site name
xret$site <- site

# return the list
xret[c("code", "site", "path")]

}

I put the code for retrieving the file into a separate function since it seemed to make sense to encapsulate this functionality into a separate function. YMMV. I took the liberty of removing the nonsensical , in filenames as well.

get_link <- function(code, dl_path, overwrite=TRUE) {

# The Download link looks like this:
#
# <a href="http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=AtlMG104">
# Download site details.
# </a>
#
# So we can mimic that with httr

site_tmpl <- "http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=%s"
dl_url <- sprintf(site_tmpl, code)

# The filename comes in a "Content-Disposition" header so we first
# do a lightweight HEAD request to get the filename

res <- httr::HEAD(dl_url)
httr::stop_for_status(res)

stri_replace_all_regex(
res$headers["content-disposition"],
'^attachment; filename="|"$', ""
) -> fil_name

# commas in filenames are a bad idea rly
fil_name <- stri_replace_all_fixed(fil_name, ",", "-")

message("Saving ", code, " to ", file.path(dl_path, fil_name))

# Then we use httr::write_disk() to do the saving in a full GET request
res <- httr::GET(
url = dl_url,
httr::write_disk(
path = file.path(dl_path, fil_name),
overwrite = overwrite
)
)

httr::stop_for_status(res)

# return a list so we can make a data frame
list(
code = code,
path = file.path(dl_path, fil_name)
)

}

Now, we get the list of sites (as promised):

# get the site list
sites <- get_list_of_sites()

length(sites)
## [1] 7484

head(sites)
## [1] "Abadia, cerrado"
## [2] "Abadia, floresta semidecídua"
## [3] "Abadiânia, cerrado"
## [4] "Abaetetuba, Rio Urubueua, floresta inundável de maré"
## [5] "Abaeté, cerrado"
## [6] "Abaeté, floresta ripícola"

We'll grab one site ZIP file:

# get one site link dl
get_site(sites[1], "/tmp")
## $code
## [1] "CerMG044"
##
## $site
## [1] "Abadia, cerrado"
##
## $path
## [1] "/tmp/neotroptree-CerMG04426-09-2018.zip"

Now, get a few more and return a data frame with code, site and save path:

# get a few (remomove [1:2] to do them all but PLEASE ADD A Sys.sleep(5) into get_link() if you do!)
map_df(sites[1:2], get_site, dl_path = "/tmp")
## # A tibble: 2 x 3
## code site path
## <chr> <chr> <chr>
## 1 CerMG044 Abadia, cerrado /tmp/neotroptree-CerMG04426-09-20…
## 2 AtlMG104 Abadia, floresta semidecídua /tmp/neotroptree-AtlMG10426-09-20…

Please heed the guidance to add a Sys.sleep(5) into get_link() if you're going to do a mass download. CPU, memory and bandwidth aren't free and it's likely that site didn't really scale the server to meet a barrage of ~8,000 back-to-back multi-HTTP request call sequence with file downloads at the end of them.

How to automate multiple requests to a web search form using R (Java function calls / triger)

The information to recreate the calculator is given on the webpage. For example to
calculate the CVD 10 year risk for a male:

cvdRiskmale <- function(age, SBP, treated, smoke, dia, HDL, TC){
eSum <- (log(age)*3.06117 +treated*1.99881*log(SBP) +(1-treated)*1.93303*log(SBP))
eSum <- eSum + (smoke*0.65451 +dia*0.57367 -0.93263*log(HDL) + 1.12370*log(TC) )
1-0.88936^exp(eSum - 23.9802)
}

> cvdRiskmale(35, 125, 0,0, 0, 45, 180)
[1] 0.02638287

> cvdRiskmale(50, 115, 0,1, 1, 45, 180)
[1] 0.2067156

compare with calculator with same options.

A similar function can be defined for females given the regression coefficients listed on the website.

Web scraping in R with Selenium to click new pages

The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.

In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.

You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.

N.B. Code can be re-factored a little as tidy up.

library(httr)
library(stringr)
library(purrr)
library(tidyverse)

get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}

r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')

matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')

application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]

headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)

params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)

df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)

df <- rbind(df, df2)

R Web Scraping Multiple Levels of a Website

The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.

Most importantly, giving credit where it's due @hrbrmstr:
R web scraping across multiple pages

The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.

library(rvest)
library(purrr)
library(stringr)
library(dplyr)

url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")

names <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_text()
#extract the names

names <- names[-c(3,4)]
#drop the head of department and blank space

names <- names %>%
tolower() %>%
str_extract_all("[:alnum:]+") %>%
sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names

content <- url %>%
html_nodes(".sppb-addon-content") %>%
html_text()

content <- content[! content %in% "+"]
#drop the "+" from the content

content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on

links <- url %>%
html_nodes(".sppb-addon-content") %>%
html_nodes("strong") %>%
html_nodes("a") %>%
html_attr("href")
#create a vector of href links

url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages

prof_info <- map_df(urls, function(x) {
#create an anonymous function to pull the data

prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
#extract the prof's name from the url

page <- read_html(x)
#read each page in the urls vector

sections <- page %>%
html_nodes(".sppb-panel-title") %>%
html_text()
#extract the section title

info <- page %>%
html_nodes(".sppb-panel-body") %>%
html_nodes(".sppb-addon-content") %>%
html_text()
#extract the info from each section

data.frame(sections = sections, info = info, prof_name = prof_name)
#create a dataframe with the section titles as the column headers and the
#info as the data in the columns

})
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead

prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages

Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.

Parsing Web page with R

Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).

Learn about HTTP GET and HTTP POST requests.

Notice the search box sends a POST request.

See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )

Construct a POST request using one of the R http packages with that form data in.

Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.

Hence you end up using postForm from RCurl package:

 p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")

And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.

Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.

There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.

Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.



Related Topics



Leave a reply



Submit