Harvest (Rvest) Multiple HTML Pages from a List of Urls

Harvest (rvest) multiple HTML pages from a list of urls

This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US

loop across multiple urls in r with rvest

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS

How to scrape data from multiple pages by dynamically updating the url with rvest

I would just create a function that scrapes for a given year, then bind the rows for that year.

  1. Use paste() to create a dynamic url with the string and a variable for the year
  2. Write the scrape function for the url (Note: You don't have to use html_text -- it's stored as a table so it's may be directly extracted as such using html_table())
  3. Loop the function through years using lapply()
  4. Combine the dfs in the list using bind_rows()

Below is an example of this process for years 2010 to 2012.

library(rvest);library(tidyverse)

scrape.draft = function(year){

url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")

out = read_html(url) %>%
html_table(header = T) %>% '[['(2) %>%
filter(!grepl("ROUND",GP)) %>%
mutate(draftYear = year)

return(out)

}

temp = lapply(2010:2012,scrape.draft) %>%
bind_rows()

Rvest: Scrape multiple URLs

Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.

library(rvest)
library(purrr)

topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
html_nodes(".titleColumn") %>%
html_nodes("a") %>%
html_attr("href") %>%
xml2::url_absolute("http://imdb.com") %>%
.[1:5] # for testing

pages <- links %>% map(read_html)

title <- pages %>%
map_chr(. %>%
html_nodes("h1") %>%
html_text()
)
rating <- pages %>%
map_dbl(. %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
)

rvest: Collecting page numbers from a website

Using RSelenium we can get the page numbers from pagination by,

    library(stringr)
library(RSelenium)
library(dplyr)

#launching browser
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)

#accept cookie
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()

#scroll to the end of page
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="end"))

#use the up_arrow to get pagination into view
webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination
link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>% html_nodes('a') %>%
html_attr('href')
#extract only page numbers from urls
str_extract(link, "[[:digit:]]+")
[1] NA "2" "3" "4" "200" "2"

Scrape and Loop with Rvest

Try this:

library(tidyverse)
library(rvest)

page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}

### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% # tbl[[j]] assigns each table from your urls as an element in the tbl list
read_html() %>%
html_node("table") %>%
html_table()
j <- j+1 # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}

#convert list to data frame
tbl <- do.call(rbind, tbl)

table[[j]] <- tbl at the end of your for loop in the original code was not necessary as we're assigning each url as an element of the tbl list here: tbl[[j]] <- urls[[j]]



Related Topics



Leave a reply



Submit