Inputting Na Where There Are Missing Values When Scraping with Rvest

Inputting NA where there are missing values when scraping with rvest

The simplest way is to select a node that encloses both of the nodes you want for each row, then iterate over them, pulling out both of the nodes you want at once. purrr::map_df is handy for not only iterating, but even combining the results into a nice tibble:

library(rvest)
library(purrr)

url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"

page <- read_html(url)

df <- page %>% 
    html_nodes('article') %>%    # select enclosing nodes
    # iterate over each, pulling out desired parts and coerce to data.frame
    map_df(~list(title = html_nodes(.x, 'h3 a') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
                 length = html_nodes(.x, '.tile .caption') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .}))

df
#> # A tibble: 12 x 2
#>                                                                                title   length
#>                                                                                <chr>    <chr>
#>  1                             Introduction to Natural Language Processing with R II 01:15:00
#>  2                                Introduction to Natural Language Processing with R 01:22:13
#>  3                                          Solving iteration problems with purrr II 01:22:49
#>  4                                             Solving iteration problems with purrr 01:32:23
#>  5                           Markov-Switching GARCH Models in R: The MSGARCH Package    15:55
#>  6                    Interactive bullwhip effect exploration using SCperf and Shiny    16:02
#>  7                             Actuarial and statistical aspects of reinsurance in R    14:15
#>  8                                                            Transformation Forests    16:19
#>  9                                                         Room 2.02 Lightning Talks    50:35
#> 10                                   R and Haskell: Combining the best of two worlds    14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment     <NA>
#> 12     Performance Benchmarking of the R Programming Environment on Knight's Landing    19:32

Rvest scraping child nodes but filling missing values with NA

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.

Swap out html_elements for html_element.

You also need to amend your xpaths to avoid getting the first node value repeated for each row.

library(tidyverse)
library(httr)

headers <- c("User-Agent" = "Safari/537.36")

r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))

r %>%
  content() %>%
  html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
  # iterate over each parent node, pulling out desired parts and coerce to data.frame
  # not the complete list
  map_df(
    ~ data.frame(
      name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
        html_text(),
      title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
        html_text(),
      put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
        html_text()
    )
  )

Placeholder NA for missing value with rvest - with XPath

Your xpath should be header/h3/a, not /header/h3/a. The leading slash would imply you want to start at the root of the tree again, not the current node.

xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%  
  map_df(~list(title = html_nodes(.x, xpath = 'header/h3/a') %>% 
                 html_text() %>% {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
               length = html_nodes(.x, xpath = 'a/time') %>% 
                 html_text() %>%  {if(length(.) == 0) NA else .}))

#   title                                                                        length  
#   <chr>                                                                        <chr>   
# 1 " Introduction to Natural Language Processing with R II"                     01:15:00
# 2 " Introduction to Natural Language Processing with R"                        01:22:13
# 3 " Solving iteration problems with purrr II"                                  01:22:49
# 4 " Solving iteration problems with purrr"                                     01:32:23
# 5 Markov-Switching GARCH Models in R: The MSGARCH Package                      15:55   
# 6 Interactive bullwhip effect exploration using SCperf and Shiny               16:02   
# 7 Actuarial and statistical aspects of reinsurance in R                        14:15   
# 8 Transformation Forests                                                       16:19

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"

here is the code:

  library(rvest)

# teams
firmen <- c(read.table("Mappe1.txt"))

# init
df <- NULL
table <- NULL

# loop
for(i in firmen){
  # find url
  url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
  page <- read_html(url,as="text")
  # grab ticker from yahoo finance
  table <- page %>%
    html_nodes(xpath = "//*[@id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
    html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
    as.data.frame()
  
  # bind to dataframe

  df <- rbind(df,table)
}

Now there is just one question left

How can i merge "df" and "firmen" into one table which has the columns:

"tickers" = df and "firmen" = firmen

because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.

basically i need to transform the list "firmen" but i don't know how

Thank you for the help

Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?

You need to change your fourth line. You want metascore to have as many elements as title, with NA for those titles that don't have a metascore listed. The way to do this is to extract the item-content nodes, and then, from each of these, to select the ratings-metascore node if it exists, or NA if it doesn't. See ?html_nodes for the difference between html_node and html_nodes. I've also added span to ensure that just the number is returned, without the following word 'metascore'.

imdb <- read_html("http://www.imdb.com/search/title?genres=horror&genres=mystery&sort=moviemeter,asc&view=advanced")
title <- html_nodes(imdb, '.lister-item-header a')
title <- html_text(title)
metascore <- html_node(html_nodes(imdb, '.lister-item-content'), '.ratings-metascore span')
metascore <- html_text(metascore)
df <- data.frame(Title = title, Metascore = metascore)

head(df,10)
                 Title  Metascore
1              Mother!       <NA>
2  Annabelle: Creation 62        
3      Stranger Things       <NA>
4         Supernatural       <NA>
5                   It       <NA>
6  The Vampire Diaries       <NA>
7              Get Out 84        
8        The Originals       <NA>
9            Annabelle 37        
10               Grimm       <NA>

How to get rvest or sapply to skip NA values?

As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.

library(rvest)
library(httr)
library(tidyverse)

# Get all possible links from this webpage. There are 26665 links.

read_html("https://ideas.repec.org/e/") %>% 
html_nodes("td") %>% 
html_nodes("a") %>% 
html_attr("href") %>% 
.[grepl(x = ., pattern = "html")] -> x

# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")

# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]

# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)

# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]

Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.

lapply(urls, function(x){

    read_html(x, encoding = "UTF-8") %>% 
    html_nodes(xpath = '//*[@id="affiliation"]') %>%
    html_nodes("h3") %>% 
    html_text() %>% 
    trimws() -> foo

     return(foo)}) -> mylist

Then, I assigned names to mylist with the links and created a data frame.

names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")

enframe(mylist) %>% 
unnest(value)

   name  value                                                                                                       
  <chr> <chr>                                                                                                       
 1 paa1  "(80%) Institutt for ØkonomiUniversitetet i Bergen"                                                         
 2 paa1  "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"                                 
 3 paa2  "Department of EconomicsCollege of BusinessUniversity of Wyoming"                                           
 4 paa6  "Statistisk SentralbyråGovernment of Norway"                                                                
 5 paa8  "Centraal Planbureau (CPB)Government of the Netherlands"                                                    
 6 paa9  "(79%) Economic StudiesBrookings Institution"                                                               
 7 paa9  "(21%) Brookings Institution"                                                                               
 8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
 9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

Inputting Na Where There Are Missing Values When Scraping with Rvest