Submit form with no submit button in rvest
Here's a dirty hack that works for me: After studying the submit_form
source code, I figured that I could work around the problem by injecting a fake submit button into my code version of the form, and then the submit_form
function would call that. It works, except that it gives a warning that often lists an inappropriate input object (not in the example below, though). However, despite the warning, the code works for me:
session <- html_session("www.chase.com")
form <- html_form(session)[[3]]
# Form on home page has no submit button,
# so inject a fake submit button or else rvest cannot submit it.
# When I do this, rvest gives a warning "Submitting with '___'", where "___" is
# often an irrelevant field item.
# This warning might be an rvest (version 0.3.2) bug, but the code works.
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
form[["fields"]][["submit"]] <- fake_submit_button
user_name <- "user"
usr_password <- "password"
filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password)
session <- submit_form(session, filledform)
The successful result displays the following warning, which I simply ignore:
> Submitting with 'submit'
Submit POST form when rvest doesn't recognize submit button
The problem is that some of the input miss the type
attr, and rvest
does not check this appropriately.
To illustrate the problem:
library(httr)
library(rvest)
#> Loading required package: xml2
sess <- html_session("http://www1.biznet.hr/HgkWeb/do/extlogon")
search_page <- sess %>%
follow_link(1)
#> Navigating to /HgkWeb/do/extlogon;jsessionid=88295900F3F932C85A25BB18F326BE28
form <- html_form(search_page)[[6]]
fill_form <- set_values(form, 'clanica.cla_oib' = '94989605030')
Some of the fields do not have the type
attribute:
sapply(fill_form$fields, function(x) '['(x, 'type'))
#> $clanica.limitSearchToActiveCompany.type
#> [1] "radio"
#>
#> $clanica.limitSearchToActiveCompany.type
#> [1] "radio"
#>
#> $joinBy.useInnerJoin.type
#> [1] "checkbox"
#>
#> $nazivTvrtke.type
#> [1] "text"
#>
#> $nazivZapocinjeSaPredanomVrijednoscu.type
#> [1] "checkbox"
#>
#> $clanica.cla_jmbp.type
#> [1] "text"
#>
#> $clanica.cla_mbs.type
#> [1] "text"
#>
#> $clanica.cla_oib.type
#> [1] "text"
#>
#> $asTextKomoraId.NA
#> NULL
#>
#> $clanica.asTextOpc_id.NA
#> NULL
#>
#> $clanica.cla_opcina.type
#> [1] "hidden"
#>
#> $clanica.asTextNas_id.NA
#> NULL
#>
#> $clanica.cla_naselje.type
#> [1] "hidden"
#>
#> $clanica.pos_id.NA
#> NULL
#>
#> $clanica.postaNaziv.type
#> [1] "hidden"
#>
#> $clanica.cla_ulica.type
#> [1] "text"
#>
#> $clanica.asTextDatumUpisaFrom.type
#> [1] "text"
#>
#> $clanica.asTextDatumUpisaTo.type
#> [1] "text"
#>
#> $clanica.asTextDatumGasenjaFrom.type
#> [1] "text"
#>
#> $clanica.asTextDatumGasenjaTo.type
#> [1] "text"
#>
#> $clanica.asTextUdr_id.NA
#> NULL
#>
#> $clanica.asTextVel_id.NA
#> NULL
#>
#> $nkd2007.type
#> [1] "text"
#>
#> $nkd2007PretrazivanjePoGlavnojDjelatnosti.type
#> [1] "radio"
#>
#> $nkd2007PretrazivanjePoGlavnojDjelatnosti.type
#> [1] "radio"
#>
#> $submit.type
#> [1] "submit"
#>
#> $org.apache.struts.taglib.html.CANCEL.type
#> [1] "submit"
#>
#> $orderBy.order1.NA
#> NULL
#>
#> $orderBy.order2.NA
#> NULL
#>
#> $limit.type
#> [1] "text"
#>
#> $searchForRowCount.type
#> [1] "checkbox"
#>
#> $joinBy.gfiGodina.NA
#> NULL
#>
#> $joinBy.gfiBrojZaposlenihFrom.type
#> [1] "text"
#>
#> $joinBy.gfiBrojZaposlenihTo.type
#> [1] "text"
#>
#> $joinBy.gfiUkupniPrihodFrom.type
#> [1] "text"
#>
#> $joinBy.gfiUkupniPrihodTo.type
#> [1] "text"
This messes up the internal function submit_request
and specifically the Filter()
in it.
It's referenced here, and a fix is proposed in this PR, but it hasn't been merged since Jul 2016, so don't hold your breath.
The fix in the PR basically check if a type
attr is present:
# form.R, row 280
is_submit <- function(x) 'type' %in% names(x) &&
tolower(x$type) %in% c("submit", "image", "button")
For a quick fix you can change the data you have, overriding the NULL
attr, with a random type:
fill_form$fields <- lapply(fill_form$fields, function(x) {
null_type = is.null(x$type)
if (null_type) x$type = 'text'
x
})
firma_i <- submit_form(search_page, fill_form, submit = 'submit')
firma_i
#> <session> http://www1.biznet.hr/HgkWeb/do/fullSearchPost
#> Status: 200
#> Type: text/html;charset=UTF-8
#> Size: 4366
Created on 2018-08-27 by the reprex package (v0.2.0).
rvest: how to submit form when input doesn't have a name?
Modifying empty fields
You can access and modify a field with an empty name directly by using the field's index, for example like this:
pgform$fields[[2]]$value <- 'Paris'
If you want to find the index of the field dynamically by its type, you could do that like this:
for (i in 1:length(pgform$fields))
if (is.null(pgform$fields[[i]]$name) && pgform$fields[[i]]$type == 'text')
pgform$fields[[i]]$value <- 'Paris'
Your specific problem
For your specific website, the above will not give you the expected results. The field you need to modify to submit a query is q
, so you would want to do something like this:
session <- html_session('https://www.tripadvisor.com/')
pgform <- html_form(session)[[1]]
pgform <- set_values(pgform, q = 'Paris')
result <- submit_form(session, pgform)
This will load the desired page for you but will not provide you with the content you are probably looking for, as that content would only be loaded dynamically by the browser using a XMLHttpRequest
. To also get the content you would instead need to do something like this:
session <- html_session('https://www.tripadvisor.com/')
pgform <- html_form(session)[[1]]
pgform <- set_values(pgform, q = 'Paris')
result <- submit_form(session, pgform, submit = NULL, httr::add_headers('x-requested-with' = 'XMLHttpRequest'))
That will give you the content without the surrounding page structure.
rvest: button with no name (submit_form)
It is a bug as of the rvest 0.3.2 version. Specifically submit_form calls submit_request:
submit_request <- function(form, submit = NULL) {
is_submit <- function(x) tolower(x$type) %in% c("submit", "image","button")
submits <- Filter(is_submit, form$fields)
...
In case x$type is a string of length 0 the %in% function does behave as expected; long story short, the resulting boolean vector has smaller size than the original form$field vector.
I wrote my own submit_request like this:
submit_request <- function(form, submit = NULL) {
is_submit <- function(x) tolower(x$type) %in% c("submit", "image", "button") && length(x$type)>0
submits <- Filter(is_submit, form$fields)
...
I suggest you to copy the necessary functions from the source code and fix it as shown above until a new stable version is released.
Best Regards
CA
How to submit a form that seems to be handled by JavaScript using httr or rvest?
We first need to get the original search page since this is a sharepoint site (or acts like one) and we need some hidden form fields to use later on:
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")
setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
str(hidden)
## Named chr [1:3] "x62pLbphYWUDXsdoNdBBNrxqyHHI+K06BzjFwdP3Uooafgey2uG1gLWxzh07djRxiQR724uplZFAI8klbq6HCSkmrp8jP15EMwvkDM/biUEuQrf"| __truncated__ ...
## - attr(*, "names")= chr [1:3] "__VIEWSTATE" "__VIEWSTATEGENERATOR" "__EVENTVALIDATION"
Now, we need to act like the form and use HTTP POST
to submit it:
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = "Smith",
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res
We're going to need this helper function in a minute:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
Now, we need the HTML from the results page:
pg <- content(res, as="parsed")
Unfortunately, the "table" is really a set of <div>
s. But, it's programmatically generated and pretty uniform. We don't want to type much so let's first get the column names we'll be using later on:
col_names <- html_nodes(pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
## [1] "offender_number" "last_name" "first_name"
## [4] "date_of_birth" "sex" "race"
## [7] "mcl_number" "location" "status"
## [10] "parole_board_jurisdiction_date" "maximum_date" "date_paroled"
The site is pretty nice in that it accommodates folks with disabilities by providing screen-reader hints. Unfortunately, this puts a kink in scraping since we wld either have to be verbose in targeting the tags with values or clean up text later on. Thankfully, the xml2
now has the ability to remove nodes:
xml_find_all(pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(pg, ".//span[@class='visible-phone']") %>% xml_remove()
We can now collect all the offender records <div>
"rows":
records <- html_nodes(pg, "div.offenderRow")
And, succinctly get them into a data frame:
map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols() %>%
readr::type_convert() -> xdf
xdf
## # A tibble: 25 x 12
## offender_number last_name first_name date_of_birth sex race mcl_number location status
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 544429 SMITH AARICK 12/03/1967 M White 333.74012D3 Gladwin Parole
## 2 210262 SMITH AARON 05/27/1972 M Black <NA> <NA> Dischrg
## 3 372965 SMITH AARON 09/16/1973 M White <NA> <NA> Dischrg
## 4 413411 SMITH AARON 07/13/1973 M Black <NA> <NA> Dischrg
## 5 618210 SMITH AARON 10/12/1984 M Black <NA> <NA> Dischrg
## 6 675823 SMITH AARON 05/19/1989 M Black 333.74032A5 Det Lahser Prob Prob
## 7 759548 SMITH AARON 06/19/1990 M Black <NA> <NA> Dischrg
## 8 763189 SMITH AARON 07/15/1976 M White 333.74032A5 Mt. Pleasant Prob
## 9 854557 SMITH AARON 12/27/1973 M White <NA> <NA> Dischrg
## 10 856804 SMITH AARON 02/24/1989 M White 750.110A2 Harrison CF Prison
## # ... with 15 more rows, and 3 more variables: parole_board_jurisdiction_date <chr>, maximum_date <chr>,
## # date_paroled <chr>
glimpse(xdf)
## Observations: 25
## Variables: 12
## $ offender_number <int> 544429, 210262, 372965, 413411, 618210, 675823, 759548, 763189, 854557, 85...
## $ last_name <chr> "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "S...
## $ first_name <chr> "AARICK", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "...
## $ date_of_birth <chr> "12/03/1967", "05/27/1972", "09/16/1973", "07/13/1973", "10/12/1984", "05/...
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
## $ race <chr> "White", "Black", "White", "Black", "Black", "Black", "Black", "White", "W...
## $ mcl_number <chr> "333.74012D3", NA, NA, NA, NA, "333.74032A5", NA, "333.74032A5", NA, "750....
## $ location <chr> "Gladwin", NA, NA, NA, NA, "Det Lahser Prob", NA, "Mt. Pleasant", NA, "Har...
## $ status <chr> "Parole", "Dischrg", "Dischrg", "Dischrg", "Dischrg", "Prob", "Dischrg", "...
## $ parole_board_jurisdiction_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "11/28/2024", "03/25/2016", NA, NA, NA...
## $ maximum_date <chr> NA, "09/03/2015", "06/29/2016", "10/02/2017", "05/19/2017", "07/18/2019", ...
## $ date_paroled <chr> "11/15/2016", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
I had hoped the type_convert
wld provide better transforms, esp for the date column(s) but it didn't and can likely be eliminated.
Now, you'll need to do some more work with the results page since since the results are paginated. Thankfully, you know the page info:
xml_integer(html_nodes(pg, "span#lblPgCurrent"))
## [1] 1
xml_integer(html_nodes(pg, "span#lblTotalPgs"))
## [1] 101
You'll have to do the "hidden" dance again:
html_nodes(pg, "input[type='hidden']")
(follow above ref for what to do with that) and rejigger a new POST
call that only has those hidden fields and one more form element: btnNext = 'Next'
. You'll need to repeat this over all the individual pages in the paginated result set then finally bind_rows()
everything.
I shld add that as you figure out the pagination workflow, start with a fresh blank search page grab. The sharepoint server seems to be configured with a pretty small viewstate session cache timeout and code will break if you wait too long between iterations.
UPDATE
I kinda wanted to make sure that last bit of advice worked so there's this:
library(httr)
library(rvest)
library(tidyverse)
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
start_search <- function(last_name) {
pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")
setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = last_name,
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res
content(res, as="parsed")
}
extract_results <- function(results_pg) {
col_names <- html_nodes(results_pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
xml_find_all(results_pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(results_pg, ".//span[@class='visible-phone']") %>% xml_remove()
records <- html_nodes(results_pg, "div.offenderRow")
map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols()
}
current_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblPgCurrent"))
}
last_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblTotalPgs"))
}
scrape_status <- function(results_pg) {
cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)
message(sprintf("%s of %s", cur, tot))
}
next_page <- function(results_pg) {
cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)
if (cur == tot) return(NULL)
setNames(
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = hidden["__EVENTTARGET"],
`__EVENTARGUMENT` = hidden["__EVENTARGUMENT"],
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
btnNext = 'Next'
),
encode = "form"
) -> res
content(res, as="parsed")
}
curr_pg <- start_search("smith")
results_df <- extract_results(curr_pg)
pb <- progress_estimated(last_page_number(curr_pg)-1)
repeat{
scrape_status(curr_pg) # optional esp since we have a progress bar
pb$tick()$print()
curr_pg <- next_page(curr_pg)
if (is.null(curr_pg)) break
results_df <- bind_rows(results_df, extract_results(next_pg))
Sys.sleep(5) # be kind
}
Hopefully you can follow along, but that shd get all the pages for you for a given search term.
RVEST select an item from 'drop down' list and submit form
You don't need to use RSelenium. You can scrape this particular site using rvest and httr, but it's a little tricky. You need to learn how to send forms in http requests. This requires a bit of exploration of the underlying html and the http requests sent by your web browser.
In your case, the form is actually pretty simple. It only has two fields: a command
field, which is always "doSelect" and a displayObject.id
, which is a unique number for each selection item, obtained from the "value" attributes of the "option" tags in the html.
Here's how we can look at the drop-downs and their associated ids:
library(tidyverse)
library(rvest)
library(httr)
url <- "http://www.ahw.gov.ab.ca/IHDA_Retrieval/"
paste0(url, "ihdaData.do") %>%
GET() %>%
read_html() %>%
html_node('#content > div > p:nth-child(8) > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() %>%
html_node('#content > div > table:nth-child(3) > tbody > tr:nth-child(10) > td > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() -> page
pages <- tibble(id = page %>% html_nodes("option") %>% html_attr("value"),
item = page %>% html_nodes("option") %>% html_text())
pages <- pages[which(pages$item != ""), ]
This gives us a listing of the available items on the page:
pages
#> # A tibble: 8 x 2
#> id item
#> <chr> <chr>
#> 1 724 Human Immunodeficiency Virus (HIV) Incidence Rate (Age Specific)
#> 2 723 Human Immunodeficiency Virus (HIV) Incidence Rate (by Geography)
#> 3 886 Human Immunodeficiency Virus (HIV) Proportion (Ethnicity)
#> 4 887 Human Immunodeficiency Virus (HIV) Proportion (Exposure Cateogory)
#> 5 719 Notifiable Diseases - Age-Sex Specific Incidence Rate
#> 6 1006 Sexually Transmitted Infections (STI) - Age-Sex Specific Case Counts (P~
#> 7 466 Sexually Transmitted Infections (STI) - Age-Sex Specific Rates of Repor~
#> 8 1110 Sexually Transmitted Infections (STI) - Quarterly Congenital Syphilis C~
Now, if we want to select the first one, we just post a list with the required parameters to the correct url, which you can find by checking the developer console in your browser (F12 in Chrome, Firefox or IE). In this case, it is the relative url "selectSubCategory.do"
params <- list(command = "doSelect", displayObject.id = pages$id[1])
next_page <- POST(paste0(url, "selectSubCategory.do"), body = params)
So now next_page
contains the html of the page you were looking for. Unfortunately, in this case it is another drop-down selection page.
Hopefully by following the methods above, you will be able to navigate the pages well enough to get the data you need.
Related Topics
Label X Axis in Time Series Plot Using R
List of Word Frequencies Using R
Ggplot2 Legend for Stat_Summary
Possible to Create Latex Multicolumns in Xtable
Crop for Spatialpolygonsdataframe
Output Error/Warning Log (Txt File) When Running R Script Under Command Line
Raw Text Strings for File Paths in R
Removing Specific Rows from a Dataframe
Export Data Frames to Excel via Xlsx with Conditional Formatting
R Command Line Passing a Filename to Script in Arguments (Windows)
Non-Numeric Argument to Binary Operator Error in R
What Does the @ Symbol Mean in R
Annotate Ggplot with an Extra Tick and Label
R - What Algorithm Does Geom_Density() Use and How to Extract Points/Equation of Curves