R Web Scraping Across Multiple Pages

R web scraping across multiple pages

You can do something similar with purrr::map_df() as well if you want all the info as a data.frame:

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39, function(i) {

# simple but effective progress indicator
cat(".")

pg <- read_html(sprintf(url_base, i))

data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
excerpt=html_text(html_nodes(pg, "div.excerpt")),
rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
appellation=html_text(html_nodes(pg, "span.appellation")),
price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...

Web Scraping in R to extract data from multiple pages

Observations:

If you scroll down the page you will see there is an option to request more results in a batch. In this particular case, setting to the maximum batch size returns all results in one go.

Monitoring the web traffic shows no additional traffic when requesting more results meaning the data is present in the original response.

Doing a search of the page source for the last symbol reveals the all the items are pre-loaded in a script tag. Examining the JavaScript source files shows the instructions for pushing new batches onto the page based on various input params.


Solution:

You can simply extract the JavaScript object from the script tag and parse as JSON. Convert the list of lists to a dataframe then add in a constructed url based on common base string + symbol.


TODO:

  1. Up to you if you wish to update header names
  2. You may wish to format the numeric column to match webpage format

R:

library(rvest)
library(jsonlite)
library(tidyverse)

r <- read_html('https://stockanalysis.com/stocks/') %>%
html_element('#__NEXT_DATA__') %>%
html_text() %>%
jsonlite::parse_json()

df <- map_df(r$props$pageProps$stocks, ~ .x)%>%
mutate(url = paste0('https://stockanalysis.com/stocks/', s))

Sample output:

Sample Image

Scraping a table over multiple pages with Rvest

One can use the last argument in the URL, &start= to iterate through the results page by page. The search results page renders 25 items per page, so the sequence of pages is 25, 50, 75, 100...

We will obtain the first 5 pages of results, for a total of 125 transactions. Since the first page starts with &start=0, we assign a vector, startRows to represent the starting row for each page.

We then use the vector to drive lapply() with an anonymous function that reads the data and manipulates it to remove the header row from each page of data read.

library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date,
"&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)

...and the output:

> head(data,n=10)
# A tibble: 10 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1947-08… Bombers … "" "• Jack Underman" fractured legs (in auto accide…
2 1948-02… Bullets … "• Harry Jeannette… "" broken rib (DTD) (date approxi…
3 1949-03… Capitols "" "• Horace McKinney /… personal reasons (DTD)
4 1949-11… Capitols "" "• Fred Scolari" fractured right cheekbone (out…
5 1949-12… Knicks "" "• Vince Boryla" mumps (out ~2 weeks)
6 1950-01… Knicks "• Vince Boryla" "" returned to lineup (date appro…
7 1950-10… Knicks "" "• Goebel Ritter / T… bruised ligaments in left ankl…
8 1950-11… Warriors "" "• Andy Phillip" lacerated foot (DTD)
9 1950-12… Celtics "" "• Andy Duncan (a)" fractured kneecap (out indefin…
10 1951-12… Bullets "" "• Don Barksdale" placed on IL
>

Verifying the results

We can verify the results by printing the first and last rows from each page, starting with last observation on page 1.

data[c(25,26,50,51,75,76,100,101,125),]

...and the output, which matches the the content rendered on pages 1 - 5 of the search results when navigated manually on the website.

> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1960-01-… Celtics "" "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics "" "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks "• Art Heyma… "" returned to lineup
4 1964-12-… Hawks "• Bob Petti… "" returned to lineup (date approximate)
5 1968-11-… Nets (ABA) "" "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… "" "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers "" "• Earnie Kill… dislocated left foot (out indefinitely)
8 1970-10-… Lakers "" "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers "• Austin Ca… "" returned to lineup

If we look at the last page in the table, we find that the maximum value for the page series is &start=61475. The R code to generate the entire sequence of pages (2460, which matches the number of pages listed in the search results on the website) is:

# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))

...and the output:

> head(pages)
[1] 0 25 50 75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475

R: How to web scrape a table across multiple pages with the same URL

You have 1838 pages to get. Example with the 10 first pages :

library(xml2)
library(RCurl)
library(dplyr)
library(rvest)

i=1
table = list()
for (i in 1:10) {
data=getURL(paste0("https://www.eurofound.europa.eu/observatories/emcc/erm/factsheets","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}

table$`Announcement date`=as.Date(table$`Announcement date`,format ="%d/%m/%Y")

Notes :

i=1 : i is the variable to increment.
table = list() : to generate an empty list (mandatory for the first "bind_rows" step).
1:10 : from first page to 10th page (should be 1:1838).
paste0 : to generate each time a new URL.
//table[2] : the table of interest.
as.integer(gsub) : mandatory for the "bind_rows" step. Columns of each list to bind has to be the same type. Column 7 could be typed as character because of the ,.
print(i) : to be informed of the progress.
as.Date : final step to convert the first column to the right type.

Other options : you could download all the pages outside the loop in an object, then process it. Maybe downloading all the pages with DTA then parse them in R would be faster.

Output :
RScrape

Web scraping multiple pages issue

The page uses javascript to call an API, you can access the data, in JSON format, like this for the first page: jsonlite::fromJSON("https://api.boliga.dk/api/v2/sold/search/results?searchTab=1&page=1&sort=date-d&zipcodeFrom=1000&zipcodeTo=2499&street=")

To put together all 4 pages, try this:

library(dplyr)
library(jsonlite)

url_base <- "https://api.boliga.dk/api/v2/sold/search/results?searchTab=1&page=%d&sort=date-d&zipcodeFrom=1000&zipcodeTo=2499&street="

Address_map <-lapply(1:4,function(i){
print(i)
# pause to prevent attacking the server
Sys.sleep(2)
#retrieve the results data frame from the returned data structure
fromJSON(sprintf(url_base, i))$result
})

#bind everything together
answer <- bind_rows(Address_map)

Scraping reviews from Multiple pages in R

You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the
title id needs to be picked up and the offset parameters can be set as ''.

After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.


Acknowledgments: I took the idea of using repeat() from @flodal here



library(tidyverse)
library(httr2)

get_reviews <- function(results, n) {
r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
req_headers("user-agent" = "mozilla/5.0") %>%
req_perform() %>%
resp_body_html() %>%
toString()

title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
start_cursor <- ""
end_cursor <- ""

repeat {
r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
req_perform() %>%
resp_body_json()
results[[n]] <- r$reviews
nextPage <- r$pageInfo$hasNextPage

if (!nextPage) break

start_cursor <- r$pageInfo$startCursor
end_cursor <- r$pageInfo$endCursor
n <- n + 1
}
return(results)
}

n <- 1
results <- list()
data <- get_reviews(results, n)

df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
date = .x$creationDate,
reviewer = .x$publication$name,
url = .x$reviewUrl,
quote = .x$quote,
score = if (is.null(.x$scoreOri)) {
NA_character_
} else {
.x$scoreOri
},
sentiment = .x$scoreSentiment
))


Related Topics



Leave a reply



Submit