Inputting Na Where There Are Missing Values When Scraping with Rvest

Inputting NA where there are missing values when scraping with rvest

The simplest way is to select a node that encloses both of the nodes you want for each row, then iterate over them, pulling out both of the nodes you want at once. purrr::map_df is handy for not only iterating, but even combining the results into a nice tibble:

library(rvest)
library(purrr)

url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"

page <- read_html(url)

df <- page %>%
html_nodes('article') %>% # select enclosing nodes
# iterate over each, pulling out desired parts and coerce to data.frame
map_df(~list(title = html_nodes(.x, 'h3 a') %>%
html_text() %>%
{if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, '.tile .caption') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))

df
#> # A tibble: 12 x 2
#> title length
#> <chr> <chr>
#> 1 Introduction to Natural Language Processing with R II 01:15:00
#> 2 Introduction to Natural Language Processing with R 01:22:13
#> 3 Solving iteration problems with purrr II 01:22:49
#> 4 Solving iteration problems with purrr 01:32:23
#> 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
#> 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
#> 7 Actuarial and statistical aspects of reinsurance in R 14:15
#> 8 Transformation Forests 16:19
#> 9 Room 2.02 Lightning Talks 50:35
#> 10 R and Haskell: Combining the best of two worlds 14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment <NA>
#> 12 Performance Benchmarking of the R Programming Environment on Knight's Landing 19:32

Rvest scraping child nodes but filling missing values with NA

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.

Swap out html_elements for html_element.

You also need to amend your xpaths to avoid getting the first node value repeated for each row.

library(tidyverse)
library(httr)

headers <- c("User-Agent" = "Safari/537.36")

r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))

r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)

Placeholder NA for missing value with rvest - with XPath

Your xpath should be header/h3/a, not /header/h3/a. The leading slash would imply you want to start at the root of the tree again, not the current node.

xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%
map_df(~list(title = html_nodes(.x, xpath = 'header/h3/a') %>%
html_text() %>% {if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, xpath = 'a/time') %>%
html_text() %>% {if(length(.) == 0) NA else .}))

# title length
# <chr> <chr>
# 1 " Introduction to Natural Language Processing with R II" 01:15:00
# 2 " Introduction to Natural Language Processing with R" 01:22:13
# 3 " Solving iteration problems with purrr II" 01:22:49
# 4 " Solving iteration problems with purrr" 01:32:23
# 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
# 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
# 7 Actuarial and statistical aspects of reinsurance in R 14:15
# 8 Transformation Forests 16:19

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"

here is the code:

  library(rvest)

# teams
firmen <- c(read.table("Mappe1.txt"))

# init
df <- NULL
table <- NULL

# loop
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[@id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
as.data.frame()

# bind to dataframe

df <- rbind(df,table)
}

Now there is just one question left

How can i merge "df" and "firmen" into one table which has the columns:

"tickers" = df and "firmen" = firmen

because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.

basically i need to transform the list "firmen" but i don't know how

Thank you for the help

Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?

You need to change your fourth line. You want metascore to have as many elements as title, with NA for those titles that don't have a metascore listed. The way to do this is to extract the item-content nodes, and then, from each of these, to select the ratings-metascore node if it exists, or NA if it doesn't. See ?html_nodes for the difference between html_node and html_nodes. I've also added span to ensure that just the number is returned, without the following word 'metascore'.

imdb <- read_html("http://www.imdb.com/search/title?genres=horror&genres=mystery&sort=moviemeter,asc&view=advanced")
title <- html_nodes(imdb, '.lister-item-header a')
title <- html_text(title)
metascore <- html_node(html_nodes(imdb, '.lister-item-content'), '.ratings-metascore span')
metascore <- html_text(metascore)
df <- data.frame(Title = title, Metascore = metascore)

head(df,10)
Title Metascore
1 Mother! <NA>
2 Annabelle: Creation 62
3 Stranger Things <NA>
4 Supernatural <NA>
5 It <NA>
6 The Vampire Diaries <NA>
7 Get Out 84
8 The Originals <NA>
9 Annabelle 37
10 Grimm <NA>

How to get rvest or sapply to skip NA values?

As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.

library(rvest)
library(httr)
library(tidyverse)

# Get all possible links from this webpage. There are 26665 links.

read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x

# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")

# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]

# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)

# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]

Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.

lapply(urls, function(x){

read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[@id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo

return(foo)}) -> mylist

Then, I assigned names to mylist with the links and created a data frame.

names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")

enframe(mylist) %>%
unnest(value)

name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~


Related Topics



Leave a reply



Submit