How do you scrape items together so you don't lose the index?
The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node
function. html_node
will always return 1 result for every node, even if it is NA.
library(rvest)
#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)
#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]') %>% html_children()
#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()
The length of providers, dept and location should all be equal.
rvest scraping data with different length
Inspection of the web page shows that the class is .price
when price has a value, and .price-txt
when it does not. So one solution is to use an XPath expression in html_nodes()
and match classes that start with "price":
listed_price <- web_page %>%
html_nodes(xpath = "//p[starts-with(@class, 'price')]") %>%
html_text()
length(listed_price)
[1] 60
Scraping with rvest - complete with NAs when tag is not present
If the tag is not found, rvest returns a character(0). So assuming you will find at most one current and one regular price in each div.product_price, you can use this:
pacman::p_load("rvest", "dplyr")
get_prices <- function(node){
r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
r.precio.actual <- html_nodes(node, 'div.price') %>% html_text
data.frame(
precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual),
stringsAsFactors=F
)
}
doc <- read_html('test.html') %>% html_nodes("div.product_price")
lapply(doc, get_prices) %>%
rbind_all
Edited: I misunderstood the input data, so changed the script to work with just a single html page.
Loop to scrape multiple elements on the same page while storing them separately
names = hxs.xpath('//td[@class="product_name"]/strong/text()')
imageurls = hxs.xpath('//tr/td[@align="center"]/a/img/@src')
for name, url in zip(names, imageurls):
item["productname"] = name
item["imgurl"] = url
yield item
Simplest way of doing it since the order of the names and image urls would correspond with each other when they are extracted.
How to scrape headers as a different column from paragraphs with rvest asuming they have different lenghts?
One option to achieve your desired result would be to extract the title and the content as a dataframe using e.g. map_dfr
. To this end I first extract the nodes containing both the title and the content via the CSS selector section .article-list .level2
. To deal with the different lengths you could put the content which may contain multiple paragraphs inside a list column which could be unnested later on. Additionally to keep only the ARTICULOS
I had to add a filter
to filter out the sections which are also extracted via the CSS selector.
library(rvest)
library(tidyverse)
url <- "https://www.constituteproject.org/constitution/Cuba_2018D?lang=es"
html <- read_html(url)
foo <- html %>%
html_nodes('section .article-list .level2')
final <- map_dfr(foo, ~ tibble(
titulo = html_nodes(.x, '.float-left') %>% html_text(),
content = list(html_nodes(.x, "p") %>% html_text()))) %>%
filter(!grepl("^SEC", titulo)) %>%
unnest_longer(content)
final
#> # A tibble: 2,145 × 2
#> titulo content
#> <chr> <chr>
#> 1 ARTÍCULO 1 Cuba es un Estado socialista de derecho, democrático, independien…
#> 2 ARTÍCULO 2 El nombre del Estado cubano es República de Cuba, el idioma ofici…
#> 3 ARTÍCULO 3 La defensa de la patria socialista es el más grande honor y el de…
#> 4 ARTÍCULO 3 El socialismo y el sistema político y social revolucionario, esta…
#> 5 ARTÍCULO 3 Los ciudadanos tienen el derecho de combatir por todos los medios…
#> 6 ARTÍCULO 4 Los símbolos nacionales son la bandera de la estrella solitaria, …
#> 7 ARTÍCULO 4 La ley define los atributos que los identifican, sus característi…
#> 8 ARTÍCULO 5 El Partido Comunista de Cuba, único, martiano, fidelista y marxis…
#> 9 ARTÍCULO 6 La Unión de Jóvenes Comunistas, organización de la juventud cuba…
#> 10 ARTÍCULO 7 La Constitución es la norma suprema del Estado. Todos están oblig…
#> # … with 2,135 more rows
How to scrape a specific table with pandas without using its dataframe index?
The tables are, IMHO, actually static and I'd try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}
soup = (
BeautifulSoup(
requests.get(
"https://ciffc.net/en/ciffc/ext/member/sitrep/",
headers=headers,
).text,
"lxml",
).find("div", {"data-title": "E: Preparedness Levels"})
)
df = pd.read_html(str(soup), flavor="lxml")[0]
print(df)
This should consistently output:
Agency APL Comments
0 BC 1 NaN
1 YT 3 Yukon is at a level 3 prep level - but will tr...
2 AB 2 NaN
3 SK 1 NaN
4 MB 1 NaN
5 ON 1 NaN
6 QC 1 NaN
7 NL 2 NaN
8 NB 1 NaN
9 NS 1 NaN
10 PE 1 NaN
11 PC 1 NaN
Equivalent of which in scraping?
Based on R Conditional evaluation when using the pipe operator %>%, you can do something like
page %>%
html_nodes(xpath='//td[@class="id-tag"]') %>%
{ifelse(is.na(html_node(.,xpath="span")),
html_text(.),
{html_node(.,xpath="span") %>% html_attr("title")}
)}
I think it is possibly simple to discard the pipe and save some of the objects created along the way
nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)
An xpath only approach might be
page %>%
html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
html_text()
Scrape An Entire Website
Consider HTTrack. It's a free and easy-to-use offline browser utility.
It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.
Related Topics
Nas Are Not Allowed in Subscripted Assignments
Why Does Dplyr's Filter Drop Na Values from a Factor Variable
How to Get Name from a Value in an R Vector with Names
Replace Values in Data Frame Based on Other Data Frame in R
Aggregate and Weighted Mean in R
Use Dplyr to Concatenate a Column
Combining Vectors of Unequal Length into a Data Frame
Truncate Decimal to Specified Places
Joining Factor Levels of Two Columns
Dividing Each Cell in a Data Set by the Column Sum in R
Contrasts Can Be Applied Only to Factor
Loess Fit and Resulting Equation
How to Run a R Language(.R) File Using Batch File
Adding Multiple Columns in a Dplyr Mutate Call
Count Unique Combinations of Values