Package "Rvest" for Web Scraping Https Site with Proxy

Web-scraping using rvest package, how do I loop over a bunch of county FIPS codes?

You could make a request to the geo file to first collect all the county codes and names associated with a given state. This can be done with a helper function. You can then write an additional helper function to tidy the html, returned from each request to a given webpage (where the url is constructed from a base string joined with county id/code), into a single DataFrame containing the info of interest. Map that latter function with future_map_dfr, from furrr, to return a single DataFrame.

Notes:

Code is written with R 4.1.0+ syntax.

Credit to @hrbrmstr for approach to handling br elements.



library(rvest)
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(furrr)
#> Loading required package: future
library(xml2)

state_county_codes <- \(state_code){
read_html(sprintf("https://farm.ewg.org/ammap/maps/js/%sCounties.js", state_code)) |>
html_text() |>
stringr::str_match("(\\[.*\\])") |>
{
\(x) x[, 1]
}() |>
jsonlite::parse_json(simplifyVector = T) |>
select(-d) |>
mutate(
id = substr(id, 2, 6),
webpage = paste0("https://farm.ewg.org/regionsummary.php?fips=", id)
) |>
tibble() -> df
}

county_summary <- \(county_code) {
page <- read_html(sprintf("https://farm.ewg.org/regionsummary.php?fips=%s", county_code))

xml_find_all(page, ".//br") |> xml_add_sibling("p", "#")
xml_find_all(page, ".//br") |> xml_remove()

t <- page |>
html_element(".table") |>
html_table()

t <- t[-c(5)] |> clean_names()

df <- data.frame(
id = county_code,
year = t$year |> stringi::stri_remove_empty() |> rep(4) |>
{
\(x) stringr::str_replace(x, "‡", "")
}(),
`subsidy_category` = stringr::str_split_fixed(t$`subsidy_category`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
amount = stringr::str_split_fixed(t$`subsidy_category_2`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
number = stringr::str_split_fixed(t$`subsidy_category_3`, "#", 4) |> stringi::stri_remove_empty() |> as.vector()
)
}

state_code <- "co"

counties <- state_county_codes(state_code)

no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
results <- future_map_dfr(counties$id, .f = county_summary)

final <- dplyr::left_join(results, counties, by = "id") |>
select(title, everything()) |>
rename(county = title)

Created on 2021-11-03 by the reprex package (v2.0.1)

Online newspaper data scraping with R, 'rvest' package

The webpage is dynamically loaded, new articles are loaded as you scroll down. Thus you need RSelenium and rvest to extract required data.

Launch browser

library(rvest)
library(RSelenium)
url = 'https://en.trend.az/archive/2021-11-02'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
#click outside in an empty space
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()

webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:17){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}

Get Article Titles

remDr$getPageSource()[[1]] %>% 
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
[1] "Chelsea defeats Malmö with minimum score"
[2] "Iran’s import of COVID-19 vaccine exceeds 146mn doses: IRICA"
[3] "Sadyr Zhaparov, Fumio Kishida discuss topical issues of Kyrgyz-Japanese relations"
[4] "We will definitely see new names at World Championships and World Age Group Competitions in Trampoline Gymnastics in Baku - Farid Gayibov"
[5] "Declaration on forest protection, land use adopted by 105 countries"
[6] "Russian Security Council's chief, CIA director meet in Moscow"
[7] "Israel to exhibit for 1st time at Dubai Airshow"
[8] "Azerbaijan's General Prosecutor's Office continues to take measures on appeal against Armenia"
[9] "Azerbaijani, Russian FMs discuss activity of working group for restoration of communications in South Caucasus"
[10] "Russia holds tenth meeting of joint Azerbaijani-Russian Demarcation Commission"
[11] "Only external reasons cause inflation in Azerbaijan - Gazprombank"
[12] "State Oil Fund of Azerbaijan launches tender for technical vendor support"


Get Links of articles

lin = remDr$getPageSource()[[1]] %>% 
read_html() %>% html_nodes('.category-news-wrapper') %>% html_nodes('.article-link')

Get Article category, date and time

remDr$getPageSource()[[1]] %>% 
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-meta') %>%
html_text()
[1] "\n Other News\n 2 November 23:55\n "
[2] "\n Society\n 2 November 23:14\n "
[3] "\n Kyrgyzstan\n 2 November 22:55\n "
[4] "\n Society\n 2 November 22:51\n "
[5] "\n Other News\n 2 November 22:26\n "
[6] "\n Russia\n 2 November 21:50\n "
[7] "\n Israel\n 2 November 21:24\n "
[8] "\n Politics\n 2 November 20:50\n "
[9] "\n Politics\n 2 November 20:25\n "
[10] "\n Politics\n 2 November 20:16\n "

Web scraping using R's rvest package and RSelenium

All the data is returned in a JSON file. You can probably construct the tables with it. For example the first table:

library(jsonlite)
library(data.table)
appData <- fromJSON("http://priceonomics.com/static/js/hotels/all_data.json")
# replicate table
myDf <- data.frame(City = names(appData), Price = sapply(appData, function(x) x$air$apt$p)
, stringsAsFactors = FALSE)
setDT(myDf)
> myDf[order(Price, decreasing = TRUE)][1:10]
City Price
1: Boston, MA 185.0
2: New York, NY 180.0
3: San Francisco, CA 165.0
4: Cambridge, MA 155.0
5: Scottsdale, AZ 142.5
6: Charlotte, NC 139.5
7: Charleston, SC 139.5
8: Las Vegas, NV 135.0
9: Miami, FL 135.0
10: Chicago, IL 130.0

Scrape webpage using Rvest

It is dynamically populated. If you don't mind some very minor differences you can issue two requests. One to the initial url to pick up a timestamp value; then issue an API request (as the page does) adding in the previously retrieved timestamp so as to get predictions for right period. Parse response to get at json holding the avisos

library(httr)
library(rvest)
library(jsonlite)

headers = c('Referer' = 'https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna')

date_value <- read_html('https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna') %>% html_node('#fecha-seleccionada-origen') %>% html_attr('value')

data <- httr::GET(url = paste0('https://www.aemet.es/es/api-eltiempo/resumen-avisos-geojson/PB/', date_value , '/D+1'), httr::add_headers(.headers=headers))

avisos <- jsonlite::parse_json(read_html(data$content) %>% html_node('p') %>% html_text())$objects$Avisos$geometries

How To Rotate Proxies and IP Addresses using R and rvest

Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I'm going to introduce httr into this answer.

Using a proxy with httr

The following code chunk shows how to use httr to query a url using a proxy and extract the html content.

page <- httr::content(
httr::GET(
url,
httr::use_proxy(ip, port, username, password)
)
)

If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.

In short, you can replace the page = read_html(site_url) with the code chunk above.

Rotating the Proxies

One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies:























ipport
64.235.204.1078080
167.71.190.25380
185.156.172.1223128

R - Scraping with rvest package

You just need to add html_table at the end of the chain:

library(rvest)

url <- read_html("https://www.hockey-reference.com/teams/CGY/2010.html")

url %>%
html_node(xpath = "//*[@id='team_stats']") %>%
html_table()

Alternatively:

library(rvest)

url %>%
html_table() %>%
.[[1]]

Both solutions return:

            Team AvAge GP  W  L OL PTS  PTS%  GF  GA   SRS  SOS TG/G PP PPO   PP% PPA PPOA   PK% SH SHA    S  S%   SA   SV%   PDO
1 Calgary Flames 28.8 82 40 32 10 90 0.549 201 203 -0.03 0.04 5.05 43 268 16.04 54 305 82.30 7 1 2350 8.6 2367 0.916 100.1
2 League Average 27.9 82 41 31 10 92 0.561 233 233 0.00 0.00 5.68 56 304 18.23 56 304 81.77 6 6 2486 9.1 2479 0.911 NA


Related Topics



Leave a reply



Submit