Web-scraping using rvest package, how do I loop over a bunch of county FIPS codes?
You could make a request to the geo file to first collect all the county codes and names associated with a given state. This can be done with a helper function. You can then write an additional helper function to tidy the html, returned from each request to a given webpage (where the url is constructed from a base string joined with county id/code), into a single DataFrame containing the info of interest. Map that latter function with future_map_dfr
, from furrr
, to return a single DataFrame.
Notes:
Code is written with R 4.1.0+ syntax.
Credit to @hrbrmstr for approach to handling br
elements.
library(rvest)
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(furrr)
#> Loading required package: future
library(xml2)
state_county_codes <- \(state_code){
read_html(sprintf("https://farm.ewg.org/ammap/maps/js/%sCounties.js", state_code)) |>
html_text() |>
stringr::str_match("(\\[.*\\])") |>
{
\(x) x[, 1]
}() |>
jsonlite::parse_json(simplifyVector = T) |>
select(-d) |>
mutate(
id = substr(id, 2, 6),
webpage = paste0("https://farm.ewg.org/regionsummary.php?fips=", id)
) |>
tibble() -> df
}
county_summary <- \(county_code) {
page <- read_html(sprintf("https://farm.ewg.org/regionsummary.php?fips=%s", county_code))
xml_find_all(page, ".//br") |> xml_add_sibling("p", "#")
xml_find_all(page, ".//br") |> xml_remove()
t <- page |>
html_element(".table") |>
html_table()
t <- t[-c(5)] |> clean_names()
df <- data.frame(
id = county_code,
year = t$year |> stringi::stri_remove_empty() |> rep(4) |>
{
\(x) stringr::str_replace(x, "‡", "")
}(),
`subsidy_category` = stringr::str_split_fixed(t$`subsidy_category`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
amount = stringr::str_split_fixed(t$`subsidy_category_2`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
number = stringr::str_split_fixed(t$`subsidy_category_3`, "#", 4) |> stringi::stri_remove_empty() |> as.vector()
)
}
state_code <- "co"
counties <- state_county_codes(state_code)
no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
results <- future_map_dfr(counties$id, .f = county_summary)
final <- dplyr::left_join(results, counties, by = "id") |>
select(title, everything()) |>
rename(county = title)
Created on 2021-11-03 by the reprex package (v2.0.1)
Online newspaper data scraping with R, 'rvest' package
The webpage is dynamically loaded, new articles are loaded as you scroll down. Thus you need RSelenium
and rvest
to extract required data.
Launch browser
library(rvest)
library(RSelenium)
url = 'https://en.trend.az/archive/2021-11-02'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
#click outside in an empty space
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:17){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
Get Article Titles
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
[1] "Chelsea defeats Malmö with minimum score"
[2] "Iran’s import of COVID-19 vaccine exceeds 146mn doses: IRICA"
[3] "Sadyr Zhaparov, Fumio Kishida discuss topical issues of Kyrgyz-Japanese relations"
[4] "We will definitely see new names at World Championships and World Age Group Competitions in Trampoline Gymnastics in Baku - Farid Gayibov"
[5] "Declaration on forest protection, land use adopted by 105 countries"
[6] "Russian Security Council's chief, CIA director meet in Moscow"
[7] "Israel to exhibit for 1st time at Dubai Airshow"
[8] "Azerbaijan's General Prosecutor's Office continues to take measures on appeal against Armenia"
[9] "Azerbaijani, Russian FMs discuss activity of working group for restoration of communications in South Caucasus"
[10] "Russia holds tenth meeting of joint Azerbaijani-Russian Demarcation Commission"
[11] "Only external reasons cause inflation in Azerbaijan - Gazprombank"
[12] "State Oil Fund of Azerbaijan launches tender for technical vendor support"
Get Links of articles
lin = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.category-news-wrapper') %>% html_nodes('.article-link')
Get Article category, date and time
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-meta') %>%
html_text()
[1] "\n Other News\n 2 November 23:55\n "
[2] "\n Society\n 2 November 23:14\n "
[3] "\n Kyrgyzstan\n 2 November 22:55\n "
[4] "\n Society\n 2 November 22:51\n "
[5] "\n Other News\n 2 November 22:26\n "
[6] "\n Russia\n 2 November 21:50\n "
[7] "\n Israel\n 2 November 21:24\n "
[8] "\n Politics\n 2 November 20:50\n "
[9] "\n Politics\n 2 November 20:25\n "
[10] "\n Politics\n 2 November 20:16\n "
Web scraping using R's rvest package and RSelenium
All the data is returned in a JSON
file. You can probably construct the tables with it. For example the first table:
library(jsonlite)
library(data.table)
appData <- fromJSON("http://priceonomics.com/static/js/hotels/all_data.json")
# replicate table
myDf <- data.frame(City = names(appData), Price = sapply(appData, function(x) x$air$apt$p)
, stringsAsFactors = FALSE)
setDT(myDf)
> myDf[order(Price, decreasing = TRUE)][1:10]
City Price
1: Boston, MA 185.0
2: New York, NY 180.0
3: San Francisco, CA 165.0
4: Cambridge, MA 155.0
5: Scottsdale, AZ 142.5
6: Charlotte, NC 139.5
7: Charleston, SC 139.5
8: Las Vegas, NV 135.0
9: Miami, FL 135.0
10: Chicago, IL 130.0
Scrape webpage using Rvest
It is dynamically populated. If you don't mind some very minor differences you can issue two requests. One to the initial url to pick up a timestamp value; then issue an API request (as the page does) adding in the previously retrieved timestamp so as to get predictions for right period. Parse response to get at json holding the avisos
library(httr)
library(rvest)
library(jsonlite)
headers = c('Referer' = 'https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna')
date_value <- read_html('https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna') %>% html_node('#fecha-seleccionada-origen') %>% html_attr('value')
data <- httr::GET(url = paste0('https://www.aemet.es/es/api-eltiempo/resumen-avisos-geojson/PB/', date_value , '/D+1'), httr::add_headers(.headers=headers))
avisos <- jsonlite::parse_json(read_html(data$content) %>% html_node('p') %>% html_text())$objects$Avisos$geometries
How To Rotate Proxies and IP Addresses using R and rvest
Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest
and xml2
use httr
for the connections. As such, I'm going to introduce httr
into this answer.
Using a proxy with httr
The following code chunk shows how to use httr
to query a url using a proxy and extract the html content.
page <- httr::content(
httr::GET(
url,
httr::use_proxy(ip, port, username, password)
)
)
If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.
In short, you can replace the page = read_html(site_url)
with the code chunk above.
Rotating the Proxies
One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies
:
ip | port |
---|---|
64.235.204.107 | 8080 |
167.71.190.253 | 80 |
185.156.172.122 | 3128 |
R - Scraping with rvest package
You just need to add html_table
at the end of the chain:
library(rvest)
url <- read_html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[@id='team_stats']") %>%
html_table()
Alternatively:
library(rvest)
url %>%
html_table() %>%
.[[1]]
Both solutions return:
Team AvAge GP W L OL PTS PTS% GF GA SRS SOS TG/G PP PPO PP% PPA PPOA PK% SH SHA S S% SA SV% PDO
1 Calgary Flames 28.8 82 40 32 10 90 0.549 201 203 -0.03 0.04 5.05 43 268 16.04 54 305 82.30 7 1 2350 8.6 2367 0.916 100.1
2 League Average 27.9 82 41 31 10 92 0.561 233 233 0.00 0.00 5.68 56 304 18.23 56 304 81.77 6 6 2486 9.1 2479 0.911 NA
Related Topics
Efficiently Merging Two Data Frames on a Non-Trivial Criteria
Converting Numeric Time to Datetime Posixct Format in R
Update Shiny's 'Selectinput' Dropdown with New Values After Uploading New Data Using Fileinput
How to Convert Dd/Mm/Yy to Yyyy-Mm-Dd in R
Convert Integer as "20160119" to Different Columns of "Day" "Year" "Month"
Remove Space Between Bars Ggplot2
Time Series Plot Gets Offset by 2 Hours If Scale_X_Datetime Is Used
Convert Factor to Integer in a Data Frame
(Igraph) Grouped Layout Based on Attribute
Why I Get This Error Writing Data to a File
Image Not Showing in Shiny App R
How to Combine Scales for Colour and Size into One Legend
R V3.4.0-2 Unable to Find Libgfortran.So.3 on Arch
Split a String Column into Several Dummy Variables
Row-Wise Sort Then Concatenate Across Specific Columns of Data Frame
How to Install Rjava for Use with 64Bit R on a 64 Bit Windows Computer