Harvest (rvest) multiple HTML pages from a list of urls
This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:
library(rvest)
library(dplyr)
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
bind_rows(lapply(url, function(x) {
data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())
})) -> toc_entries
df <- toc_entries %>% left_join(df)
df[sample(nrow(df), 10),]
## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US
loop across multiple urls in r with rvest
You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html()
requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site
list with lapply
then bind all dfs together:
jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")
dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})
finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS
How to scrape data from multiple pages by dynamically updating the url with rvest
I would just create a function that scrapes for a given year, then bind the rows for that year.
- Use
paste()
to create a dynamic url with the string and a variable for the year - Write the scrape function for the url (Note: You don't have to use html_text -- it's stored as a table so it's may be directly extracted as such using
html_table()
) - Loop the function through years using
lapply()
- Combine the dfs in the list using
bind_rows()
Below is an example of this process for years 2010 to 2012.
library(rvest);library(tidyverse)
scrape.draft = function(year){
url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")
out = read_html(url) %>%
html_table(header = T) %>% '[['(2) %>%
filter(!grepl("ROUND",GP)) %>%
mutate(draftYear = year)
return(out)
}
temp = lapply(2010:2012,scrape.draft) %>%
bind_rows()
Rvest: Scrape multiple URLs
Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.
library(rvest)
library(purrr)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
html_nodes(".titleColumn") %>%
html_nodes("a") %>%
html_attr("href") %>%
xml2::url_absolute("http://imdb.com") %>%
.[1:5] # for testing
pages <- links %>% map(read_html)
title <- pages %>%
map_chr(. %>%
html_nodes("h1") %>%
html_text()
)
rating <- pages %>%
map_dbl(. %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
)
rvest: Collecting page numbers from a website
Using RSelenium
we can get the page numbers from pagination by,
library(stringr)
library(RSelenium)
library(dplyr)
#launching browser
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)
#accept cookie
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#scroll to the end of page
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="end"))
#use the up_arrow to get pagination into view
webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination
link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>% html_nodes('a') %>%
html_attr('href')
#extract only page numbers from urls
str_extract(link, "[[:digit:]]+")
[1] NA "2" "3" "4" "200" "2"
Scrape and Loop with Rvest
Try this:
library(tidyverse)
library(rvest)
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% # tbl[[j]] assigns each table from your urls as an element in the tbl list
read_html() %>%
html_node("table") %>%
html_table()
j <- j+1 # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}
#convert list to data frame
tbl <- do.call(rbind, tbl)
table[[j]] <- tbl
at the end of your for loop in the original code was not necessary as we're assigning each url as an element of the tbl
list here: tbl[[j]] <- urls[[j]]
Related Topics
Count Unique Combinations of Values
Why Does Dplyr's Filter Drop Na Values from a Factor Variable
How to Deploy Shiny App That Uses Local Data
Rscript Could Not Find Function
Combine Multiple PDF Plots into One File
How to Place Legends at Different Sides of Plot (Bottom and Right Side) with Ggplot2
Converting Yearmon Column to Last Date of the Month in R
Lm(): What Is Qraux Returned by Qr Decomposition in Linpack/Lapack
How to Convert Numeric Values to Time Without the Date
Dplyr::Select() with Some Variables That May Not Exist in the Data Frame
Convert a Mm-Yy String "Jan-01" into Date Format
1-Dimensional Matrix Is Changed to a Vector in R
Addsma Not Drawn on Graph When Called from Function
Create a Histogram for Weighted Values