Scraping Tables on Multiple Web Pages with Rvest in R

Scraping a table over multiple pages with Rvest

One can use the last argument in the URL, &start= to iterate through the results page by page. The search results page renders 25 items per page, so the sequence of pages is 25, 50, 75, 100...

We will obtain the first 5 pages of results, for a total of 125 transactions. Since the first page starts with &start=0, we assign a vector, startRows to represent the starting row for each page.

We then use the vector to drive lapply() with an anonymous function that reads the data and manipulates it to remove the header row from each page of data read.

library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date,
"&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)

...and the output:

> head(data,n=10)
# A tibble: 10 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1947-08… Bombers … "" "• Jack Underman" fractured legs (in auto accide…
2 1948-02… Bullets … "• Harry Jeannette… "" broken rib (DTD) (date approxi…
3 1949-03… Capitols "" "• Horace McKinney /… personal reasons (DTD)
4 1949-11… Capitols "" "• Fred Scolari" fractured right cheekbone (out…
5 1949-12… Knicks "" "• Vince Boryla" mumps (out ~2 weeks)
6 1950-01… Knicks "• Vince Boryla" "" returned to lineup (date appro…
7 1950-10… Knicks "" "• Goebel Ritter / T… bruised ligaments in left ankl…
8 1950-11… Warriors "" "• Andy Phillip" lacerated foot (DTD)
9 1950-12… Celtics "" "• Andy Duncan (a)" fractured kneecap (out indefin…
10 1951-12… Bullets "" "• Don Barksdale" placed on IL
>

Verifying the results

We can verify the results by printing the first and last rows from each page, starting with last observation on page 1.

data[c(25,26,50,51,75,76,100,101,125),]

...and the output, which matches the the content rendered on pages 1 - 5 of the search results when navigated manually on the website.

> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1960-01-… Celtics "" "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics "" "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks "• Art Heyma… "" returned to lineup
4 1964-12-… Hawks "• Bob Petti… "" returned to lineup (date approximate)
5 1968-11-… Nets (ABA) "" "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… "" "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers "" "• Earnie Kill… dislocated left foot (out indefinitely)
8 1970-10-… Lakers "" "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers "• Austin Ca… "" returned to lineup

If we look at the last page in the table, we find that the maximum value for the page series is &start=61475. The R code to generate the entire sequence of pages (2460, which matches the number of pages listed in the search results on the website) is:

# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))

...and the output:

> head(pages)
[1] 0 25 50 75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475

Using Rvest to scrape text, table, and combine the two from multiple pages

Consider this approach. We only need to use html_node because your code suggests that there is only one table per page to scrape.

library(tidyverse)
library(rvest)

get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()

urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)

tibble(urls) %>%
mutate(
page = map(urls, read_html),
newcol = map_chr(page, get_title),
data = map(page, get_table),
page = NULL, urls = NULL
) %>%
unnest(data)

Output

# A tibble: 52 x 7
newcol `Ward No.` `Ward Name` `Elected Members` Role Party Reservation
<chr> <int> <chr> <chr> <chr> <chr> <chr>
1 Thiruvananthapuram - Chemmaruthy Grama Panchayat 1 VANDIPPURA BABY P Member CPI(M) Woman
2 Thiruvananthapuram - Chemmaruthy Grama Panchayat 2 PALAYAMKUNNU SREELATHA D Member INC Woman
3 Thiruvananthapuram - Chemmaruthy Grama Panchayat 3 KOVOOR KAVITHA V Member INC Woman
4 Thiruvananthapuram - Chemmaruthy Grama Panchayat 4 SIVAPURAM ANIL. V Member INC General
5 Thiruvananthapuram - Chemmaruthy Grama Panchayat 5 MUTHANA JAYALEKSHMI S Member INC Woman
6 Thiruvananthapuram - Chemmaruthy Grama Panchayat 6 MAVINMOODU S SASIKALA NATH Member CPI(M) Woman
7 Thiruvananthapuram - Chemmaruthy Grama Panchayat 7 NJEKKADU P.MANILAL Member INC General
8 Thiruvananthapuram - Chemmaruthy Grama Panchayat 8 CHEMMARUTHY SASEENDRA President INC Woman
9 Thiruvananthapuram - Chemmaruthy Grama Panchayat 9 PANCHAYAT OFFICE PRASANTH PANAYARA Member INC General
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat 10 VALIYAVILA SANJAYAN S Member INC General
# ... with 42 more rows

R: How to web scrape a table across multiple pages with the same URL

You have 1838 pages to get. Example with the 10 first pages :

library(xml2)
library(RCurl)
library(dplyr)
library(rvest)

i=1
table = list()
for (i in 1:10) {
data=getURL(paste0("https://www.eurofound.europa.eu/observatories/emcc/erm/factsheets","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}

table$`Announcement date`=as.Date(table$`Announcement date`,format ="%d/%m/%Y")

Notes :

i=1 : i is the variable to increment.
table = list() : to generate an empty list (mandatory for the first "bind_rows" step).
1:10 : from first page to 10th page (should be 1:1838).
paste0 : to generate each time a new URL.
//table[2] : the table of interest.
as.integer(gsub) : mandatory for the "bind_rows" step. Columns of each list to bind has to be the same type. Column 7 could be typed as character because of the ,.
print(i) : to be informed of the progress.
as.Date : final step to convert the first column to the right type.

Other options : you could download all the pages outside the loop in an object, then process it. Maybe downloading all the pages with DTA then parse them in R would be faster.

Output :
RScrape

R - Loop to scrap and combine pages using rvest

Here is a loop to scrape and combine pages 1-4. I am not sure how to scrape the website to see how many pages are in the set, so for now the number of pages should be changed manually.

pages <- 1:4 # where 4==whatever the number of pages is..

WA_list=list()
for(i in seq_along(pages)){
WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
WA_page<-read_html(WA_link)

WA_list[[i]] <- WA_page %>% html_nodes("table.records-table") %>%
html_table() %>% . [[1]]

}
WA_table <- dplyr::bind_rows(WA_list)

Alternatively, you can scan many more than expected.

pages <- c(1:100)

WA_list=vector("list", length(pages))
## "pre-allocate" an empty list of length 5
for(i in seq_along(pages)){
print(i)
WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
WA_page<-read_html(WA_link)

WA_list[[i]] <- WA_page %>% html_nodes("table.records-table") %>%
html_table() %>% . [[1]]
WA_table <- dplyr::bind_rows(WA_list) # this is a crude solution to creating a data frame while allowing the loop to stop when the max page has been reached. ideally, there would be a logical here for when no data is retrieved on pages[i]
}

Note: Hopefully someone can edit this answer to exit the loop at i where no data exists in WA_page %>% html_nodes("table.records-table") %>% html_table() %>% . [[1]]. This solution would then move the bind_rows() to after the loop to prevent redundant processes.

Scraping tables on multiple web pages with rvest in R

One way would be to make vector of all the urls you are interested in and then use sapply:

library(rvest)

years <- 1970:2016
urls <- paste0("http://www.baseball-reference.com/teams/MIL/", years, ".shtml")
# head(urls)

get_table <- function(url) {
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>%
html_table()
}

results <- sapply(urls, get_table)

results should be a list of 47 data.frame objects; each should be named with the url (i.e., year) they represent. That is, results[1] corresponds to 1970, and results[47] corresponds to 2016.

Web Scraping in R to extract data from multiple pages

Observations:

If you scroll down the page you will see there is an option to request more results in a batch. In this particular case, setting to the maximum batch size returns all results in one go.

Monitoring the web traffic shows no additional traffic when requesting more results meaning the data is present in the original response.

Doing a search of the page source for the last symbol reveals the all the items are pre-loaded in a script tag. Examining the JavaScript source files shows the instructions for pushing new batches onto the page based on various input params.


Solution:

You can simply extract the JavaScript object from the script tag and parse as JSON. Convert the list of lists to a dataframe then add in a constructed url based on common base string + symbol.


TODO:

  1. Up to you if you wish to update header names
  2. You may wish to format the numeric column to match webpage format

R:

library(rvest)
library(jsonlite)
library(tidyverse)

r <- read_html('https://stockanalysis.com/stocks/') %>%
html_element('#__NEXT_DATA__') %>%
html_text() %>%
jsonlite::parse_json()

df <- map_df(r$props$pageProps$stocks, ~ .x)%>%
mutate(url = paste0('https://stockanalysis.com/stocks/', s))

Sample output:

Sample Image

Scraping reviews from Multiple pages in R

You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the
title id needs to be picked up and the offset parameters can be set as ''.

After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.


Acknowledgments: I took the idea of using repeat() from @flodal here



library(tidyverse)
library(httr2)

get_reviews <- function(results, n) {
r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
req_headers("user-agent" = "mozilla/5.0") %>%
req_perform() %>%
resp_body_html() %>%
toString()

title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
start_cursor <- ""
end_cursor <- ""

repeat {
r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
req_perform() %>%
resp_body_json()
results[[n]] <- r$reviews
nextPage <- r$pageInfo$hasNextPage

if (!nextPage) break

start_cursor <- r$pageInfo$startCursor
end_cursor <- r$pageInfo$endCursor
n <- n + 1
}
return(results)
}

n <- 1
results <- list()
data <- get_reviews(results, n)

df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
date = .x$creationDate,
reviewer = .x$publication$name,
url = .x$reviewUrl,
quote = .x$quote,
score = if (is.null(.x$scoreOri)) {
NA_character_
} else {
.x$scoreOri
},
sentiment = .x$scoreSentiment
))


Related Topics



Leave a reply



Submit