Using Rvest to Scrape a Website W/ a Login Page

Using rvest to scrape a website w/ a login page

Nvm, got it to work by using url <- jump_to(session, "https://premium.usnews.com/best-graduate-schools/top-medical-schools/research-rankings")

Web-Scraping with Login and Redirect using R and rvest/httr

library(rvest)
url<-"https://kickbase.sky.de/"
page<-html_session(url)
page<-rvest:::request_POST(page,url="https://kickbase.sky.de/api/v1/user/login",
body=list("email"="testscrape@gmail.com",
"password"="tester",
"redirect_url"="http://kickbase.sky.de/spielerprofil/nadiem-amiri/1639#"),
encode='json'
)
player_page<-jump_to(page,"https://kickbase.sky.de/api/v1/news?skip=0&player=1639&limit=3")
data<-jsonlite::fromJSON(readBin(player_page$response$content,what="json"))

print(data)

Please note that the website provides an API and that is where you get the data
https://kickbase.sky.de/api/v1/news?skip=0&player=1639&limit=3

variable data has all the information needed

Using rvest to scrape specific values from a web page

Here is solution retrieving the table of prices and then performing some data cleaning:

Still requires some additional clean-up but the majority is done.

library(rvest)
library(dplyr)
library(stringr)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")

output <- url1 %>%
html_nodes(xpath = './/table[@id="hprt-table"]') %>%
html_table() %>% .[[1]]


#Fix column name
colnames(output)[5] <- "Quantity"

#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)

answer
# A tibble: 8 x 5
`Accommodation Ty… Sleeps `Today's price` `Your choices` Quantity
<chr> <chr> <chr> <chr> <chr>
1 Triple Room Max persons: 3 US$398 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$398) 2 (US$795) 3 (US$1,193) 4 (US$…
2 Triple Room Max persons: 1 … US$313 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$313) 2 (US$626) 3 (US$939) 4 (US$1,…
3 Standard Queen Ro… Max persons: 2 US$325 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$325) 2 (US$650) 3 (US$976) 4 (US$1,…
4 Standard Queen Ro… Max persons: 1 … US$241 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$241) 2 (US$481) 3 (US$722) 4 (US$96…
5 Superior Queen Ro… Max persons: 2 US$354 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$354) 2 (US$708) 3 (US$1,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US$270 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$270) 2 (US$539) 3 (US$809) 4 (US$1,…
7 Deluxe Family Room Max persons: 2 US$532 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$532) 2 (US$1,064) 3 (US$1,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US$447 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$447) 2 (US$895) 3 (US$1,342) 4 (US$…

Using rvest or httr to log in to non-standard forms on a webpage

Your rvest code isn't storing the modified form, so in you're example you're just submitting the original pgform without the values being filled out. Try:

library(rvest)

url <-"http://www.perfectgame.org/" ## page to spider
pgsession <-html_session(url) ## create session
pgform <-html_form(pgsession)[[1]] ## pull form from session

# Note the new variable assignment

filled_form <- set_values(pgform,
`ctl00$Header2$HeaderTop1$tbUsername` = "myemail@gmail.com",
`ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")

submit_form(pgsession,filled_form)

And I now see a nice 200 status code response instead of an error. Note that because the desired submit button appears to be the first submit button, we don't need to give it as an argument, but otherwise we'd just be giving it a a string (straight quotes, not back quotes).



Related Topics



Leave a reply



Submit