Equivalent of which in scraping?
Based on R Conditional evaluation when using the pipe operator %>%, you can do something like
page %>%
html_nodes(xpath='//td[@class="id-tag"]') %>%
{ifelse(is.na(html_node(.,xpath="span")),
html_text(.),
{html_node(.,xpath="span") %>% html_attr("title")}
)}
I think it is possibly simple to discard the pipe and save some of the objects created along the way
nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)
An xpath only approach might be
page %>%
html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
html_text()
How do you scrape items together so you don't lose the index?
The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node
function. html_node
will always return 1 result for every node, even if it is NA.
library(rvest)
#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)
#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]') %>% html_children()
#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()
The length of providers, dept and location should all be equal.
Web scraping with Scheme
I've used a combination of the Racket net/url library, the html-parsing package, and SXML (especially sxpath
, for XPath queries). Actually, I wrote some wrappers around net/url that make it slightly easier to use, IMO.
When I've needed to handle cookies, I've called out to the curl
command instead of using net/url.
Is there a Python equivalent for the Perl module Term::VT102?
Pexpect, which has support for VT100, might be of help to you.
Web scraping with Python without loading the whole page
Reverse engineering the api calls.
You should analyze the network tab for the incoming and outgoing requests and view response for each request. Alternatively you can also copy the request as curl and use postman to analyze the request. Postman has feature unique which generates python code for requests library and urllib library. Most of the sites return json
response but sometimes however you may get html
response.
Some sites do not allow scraping.
Make sure to check robot.txt for the website you will be scraping. You can find robot.txt by www.sitename.com/robots.txt
.
For more info - https://www.youtube.com/watch?v=LPU08ZfP-II&list=PLL2hlSFBmWwwvFk4bBqaPRV4GP19CgZug
Related Topics
"Non-Finite Function Value" When Using Integrate() in R
R: Split String into Numeric and Return the Mean as a New Column in a Data Frame
Grouping Factor Levels in a Data.Table
R Geom_Tile Ggplot2 What Kind of Stat Is Applied
Data.Table: Sum by All Existing Combinations in Table
How to Draw Roc Curve Using Value of Confusion Matrix
How to Color Bar Plots When Using ..Prop.. in Ggplot
R - Identify Consecutive Sequences
Test If Element Is in a List and Return 0 or 1
How to Manage Parallel Processing with Animated Ggplot2-Plot
Can Ggplot2 Find the Intersections - or Is There Any Other Neat Way
Duplicate Couples (Id-Time) Error in Plm with Only Two Ids
Pre-Select Rows of a Dynamic Dt in Shiny
Ggplot2 Issue: Graph Text Shown with Weird Unicode Blocks
R Shiny - Ui.R Seems to Not Recognize a Dataframe Read by Server.R