Equivalent of Which in Scraping

Equivalent of which in scraping?

Based on R Conditional evaluation when using the pipe operator %>%, you can do something like

page %>% 
html_nodes(xpath='//td[@class="id-tag"]') %>%
{ifelse(is.na(html_node(.,xpath="span")),
html_text(.),
{html_node(.,xpath="span") %>% html_attr("title")}
)}

I think it is possibly simple to discard the pipe and save some of the objects created along the way

nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)

An xpath only approach might be

page %>% 
html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
html_text()

How do you scrape items together so you don't lose the index?

The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node function. html_node will always return 1 result for every node, even if it is NA.

library(rvest)

#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)

#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]') %>% html_children()

#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()

The length of providers, dept and location should all be equal.

Web scraping with Scheme

I've used a combination of the Racket net/url library, the html-parsing package, and SXML (especially sxpath, for XPath queries). Actually, I wrote some wrappers around net/url that make it slightly easier to use, IMO.

When I've needed to handle cookies, I've called out to the curl command instead of using net/url.

Is there a Python equivalent for the Perl module Term::VT102?

Pexpect, which has support for VT100, might be of help to you.

Web scraping with Python without loading the whole page

Reverse engineering the api calls.

You should analyze the network tab for the incoming and outgoing requests and view response for each request. Alternatively you can also copy the request as curl and use postman to analyze the request. Postman has feature unique which generates python code for requests library and urllib library. Most of the sites return json response but sometimes however you may get html response.

Some sites do not allow scraping.
Make sure to check robot.txt for the website you will be scraping. You can find robot.txt by www.sitename.com/robots.txt.

For more info - https://www.youtube.com/watch?v=LPU08ZfP-II&list=PLL2hlSFBmWwwvFk4bBqaPRV4GP19CgZug



Related Topics



Leave a reply



Submit