Harvest (Rvest) Multiple HTML Pages from a List of Urls

Harvest (rvest) multiple HTML pages from a list of urls

This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", 
         "http://en.wikipedia.org/wiki/Canada",
         "http://en.wikipedia.org/wiki/Japan", 
         "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

  data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
    html_nodes(".toctext") %>%
    html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
## 
##                                           url                            toc_entry country
## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
## 10        http://en.wikipedia.org/wiki/Canada                             Military      US

loop across multiple urls in r with rvest

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)             # ASSUMING ALL DFs MAINTAIN SAME COLS

How to scrape data from multiple pages by dynamically updating the url with rvest

I would just create a function that scrapes for a given year, then bind the rows for that year.

Use paste() to create a dynamic url with the string and a variable for the year
Write the scrape function for the url (Note: You don't have to use html_text -- it's stored as a table so it's may be directly extracted as such using html_table())
Loop the function through years using lapply()
Combine the dfs in the list using bind_rows()

Below is an example of this process for years 2010 to 2012.

library(rvest);library(tidyverse)

scrape.draft = function(year){

  url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")

  out = read_html(url) %>%
    html_table(header = T) %>% '[['(2) %>%
    filter(!grepl("ROUND",GP)) %>%
    mutate(draftYear = year)

  return(out)

}

temp = lapply(2010:2012,scrape.draft) %>%
  bind_rows()

Rvest: Scrape multiple URLs

Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.

library(rvest)
library(purrr)

topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
  html_nodes(".titleColumn") %>%
  html_nodes("a") %>%
  html_attr("href") %>% 
  xml2::url_absolute("http://imdb.com") %>% 
  .[1:5] # for testing

pages <- links %>% map(read_html)

title <- pages %>% 
  map_chr(. %>% 
    html_nodes("h1") %>% 
    html_text()
  )
rating <- pages %>% 
  map_dbl(. %>% 
    html_nodes("strong span") %>% 
    html_text() %>% 
    as.numeric()
  )

rvest: Collecting page numbers from a website

Using RSelenium we can get the page numbers from pagination by,

    library(stringr)
    library(RSelenium)
    library(dplyr)

#launching browser 
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)

#accept cookie
    remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
    
#scroll to the end of page
    webElem <- remDr$findElement("css", "html")
    webElem$sendKeysToElement(list(key="end"))
    
#use the up_arrow to get pagination into view
    webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination  
    link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>%  html_nodes('a') %>% 
      html_attr('href')
#extract only page numbers from urls
    str_extract(link, "[[:digit:]]+")
    [1] NA    "2"   "3"   "4"   "200" "2"

Scrape and Loop with Rvest

Try this:

library(tidyverse)
library(rvest)

page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}

### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
  tbl[[j]] <- urls[[j]] %>%   # tbl[[j]] assigns each table from your urls as an element in the tbl list
    read_html() %>% 
    html_node("table") %>%
    html_table()
  j <- j+1                    # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}

#convert list to data frame
tbl <- do.call(rbind, tbl)

table[[j]] <- tbl at the end of your for loop in the original code was not necessary as we're assigning each url as an element of the tbl list here: tbl[[j]] <- urls[[j]]

Harvest (Rvest) Multiple HTML Pages from a List of Urls