How to Scrape Items Together So You Don't Lose the Index

How do you scrape items together so you don't lose the index?

The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node function. html_node will always return 1 result for every node, even if it is NA.

library(rvest)

#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)

#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]') %>% html_children()

#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()

The length of providers, dept and location should all be equal.

rvest scraping data with different length

Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":

listed_price <- web_page %>% 
html_nodes(xpath = "//p[starts-with(@class, 'price')]") %>%
html_text()

length(listed_price)
[1] 60

Scraping with rvest - complete with NAs when tag is not present

If the tag is not found, rvest returns a character(0). So assuming you will find at most one current and one regular price in each div.product_price, you can use this:

pacman::p_load("rvest", "dplyr")

get_prices <- function(node){
r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
r.precio.actual <- html_nodes(node, 'div.price') %>% html_text

data.frame(
precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual),
stringsAsFactors=F
)

}

doc <- read_html('test.html') %>% html_nodes("div.product_price")
lapply(doc, get_prices) %>%
rbind_all

Edited: I misunderstood the input data, so changed the script to work with just a single html page.

Loop to scrape multiple elements on the same page while storing them separately

names = hxs.xpath('//td[@class="product_name"]/strong/text()')
imageurls = hxs.xpath('//tr/td[@align="center"]/a/img/@src')
for name, url in zip(names, imageurls):
item["productname"] = name
item["imgurl"] = url
yield item

Simplest way of doing it since the order of the names and image urls would correspond with each other when they are extracted.

How to scrape headers as a different column from paragraphs with rvest asuming they have different lenghts?

One option to achieve your desired result would be to extract the title and the content as a dataframe using e.g. map_dfr. To this end I first extract the nodes containing both the title and the content via the CSS selector section .article-list .level2. To deal with the different lengths you could put the content which may contain multiple paragraphs inside a list column which could be unnested later on. Additionally to keep only the ARTICULOS I had to add a filter to filter out the sections which are also extracted via the CSS selector.

library(rvest)
library(tidyverse)

url <- "https://www.constituteproject.org/constitution/Cuba_2018D?lang=es"

html <- read_html(url)

foo <- html %>%
html_nodes('section .article-list .level2')

final <- map_dfr(foo, ~ tibble(
titulo = html_nodes(.x, '.float-left') %>% html_text(),
content = list(html_nodes(.x, "p") %>% html_text()))) %>%
filter(!grepl("^SEC", titulo)) %>%
unnest_longer(content)

final
#> # A tibble: 2,145 × 2
#> titulo content
#> <chr> <chr>
#> 1 ARTÍCULO 1 Cuba es un Estado socialista de derecho, democrático, independien…
#> 2 ARTÍCULO 2 El nombre del Estado cubano es República de Cuba, el idioma ofici…
#> 3 ARTÍCULO 3 La defensa de la patria socialista es el más grande honor y el de…
#> 4 ARTÍCULO 3 El socialismo y el sistema político y social revolucionario, esta…
#> 5 ARTÍCULO 3 Los ciudadanos tienen el derecho de combatir por todos los medios…
#> 6 ARTÍCULO 4 Los símbolos nacionales son la bandera de la estrella solitaria, …
#> 7 ARTÍCULO 4 La ley define los atributos que los identifican, sus característi…
#> 8 ARTÍCULO 5 El Partido Comunista de Cuba, único, martiano, fidelista y marxis…
#> 9 ARTÍCULO 6 La Unión de Jóvenes Comunistas, organización de la juventud cuba…
#> 10 ARTÍCULO 7 La Constitución es la norma suprema del Estado. Todos están oblig…
#> # … with 2,135 more rows

How to scrape a specific table with pandas without using its dataframe index?

The tables are, IMHO, actually static and I'd try this:

import requests
from bs4 import BeautifulSoup

import pandas as pd

headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
BeautifulSoup(
requests.get(
"https://ciffc.net/en/ciffc/ext/member/sitrep/",
headers=headers,
).text,
"lxml",
).find("div", {"data-title": "E: Preparedness Levels"})
)

df = pd.read_html(str(soup), flavor="lxml")[0]
print(df)

This should consistently output:

   Agency  APL                                           Comments
0 BC 1 NaN
1 YT 3 Yukon is at a level 3 prep level - but will tr...
2 AB 2 NaN
3 SK 1 NaN
4 MB 1 NaN
5 ON 1 NaN
6 QC 1 NaN
7 NL 2 NaN
8 NB 1 NaN
9 NS 1 NaN
10 PE 1 NaN
11 PC 1 NaN

Equivalent of which in scraping?

Based on R Conditional evaluation when using the pipe operator %>%, you can do something like

page %>% 
html_nodes(xpath='//td[@class="id-tag"]') %>%
{ifelse(is.na(html_node(.,xpath="span")),
html_text(.),
{html_node(.,xpath="span") %>% html_attr("title")}
)}

I think it is possibly simple to discard the pipe and save some of the objects created along the way

nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)

An xpath only approach might be

page %>% 
html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
html_text()

Scrape An Entire Website

Consider HTTrack. It's a free and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.



Related Topics



Leave a reply



Submit