Inputting Na Where There Are Missing Values When Scraping with Rvest

The simplest way is to select a node that encloses both of the nodes you want for each row, then iterate over them, pulling out both of the nodes you want at once. purrr::map_df is handy for not only iterating, but even combining the results into a nice tibble:


url <- ""

page <- read_html(url)

df <- page %>%
html_nodes('article') %>% # select enclosing nodes
# iterate over each, pulling out desired parts and coerce to data.frame
map_df(~list(title = html_nodes(.x, 'h3 a') %>%
html_text() %>%
{if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, '.tile .caption') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))

#> # A tibble: 12 x 2
#> title length
#> <chr> <chr>
#> 1 Introduction to Natural Language Processing with R II 01:15:00
#> 2 Introduction to Natural Language Processing with R 01:22:13
#> 3 Solving iteration problems with purrr II 01:22:49
#> 4 Solving iteration problems with purrr 01:32:23
#> 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
#> 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
#> 7 Actuarial and statistical aspects of reinsurance in R 14:15
#> 8 Transformation Forests 16:19
#> 9 Room 2.02 Lightning Talks 50:35
#> 10 R and Haskell: Combining the best of two worlds 14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment <NA>
#> 12 Performance Benchmarking of the R Programming Environment on Knight's Landing 19:32

Rvest scraping child nodes but filling missing values with NA

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.

Swap out html_elements for html_element.

You also need to amend your xpaths to avoid getting the first node value repeated for each row.


headers <- c("User-Agent" = "Safari/537.36")

r <- httr::GET(url = "", httr::add_headers(.headers = headers))

r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%

Placeholder NA for missing value with rvest - with XPath

Your xpath should be header/h3/a, not /header/h3/a. The leading slash would imply you want to start at the root of the tree again, not the current node.

xx <- read_html("")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%
map_df(~list(title = html_nodes(.x, xpath = 'header/h3/a') %>%
html_text() %>% {if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, xpath = 'a/time') %>%
html_text() %>% {if(length(.) == 0) NA else .}))

# title length
# <chr> <chr>
# 1 " Introduction to Natural Language Processing with R II" 01:15:00
# 2 " Introduction to Natural Language Processing with R" 01:22:13
# 3 " Solving iteration problems with purrr II" 01:22:49
# 4 " Solving iteration problems with purrr" 01:32:23
# 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
# 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
# 7 Actuarial and statistical aspects of reinsurance in R 14:15
# 8 Transformation Forests 16:19

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"

here is the code:


# teams
firmen <- c(read.table("Mappe1.txt"))

# init
df <- NULL
table <- NULL

# loop
for(i in firmen){
# find url
url <- paste0("", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[@id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%

# bind to dataframe

df <- rbind(df,table)

Now there is just one question left

How can i merge "df" and "firmen" into one table which has the columns:

"tickers" = df and "firmen" = firmen

because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.

basically i need to transform the list "firmen" but i don't know how

Thank you for the help

Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?

You need to change your fourth line. You want metascore to have as many elements as title, with NA for those titles that don't have a metascore listed. The way to do this is to extract the item-content nodes, and then, from each of these, to select the ratings-metascore node if it exists, or NA if it doesn't. See ?html_nodes for the difference between html_node and html_nodes. I've also added span to ensure that just the number is returned, without the following word 'metascore'.

imdb <- read_html(",asc&view=advanced")
title <- html_nodes(imdb, '.lister-item-header a')
title <- html_text(title)
metascore <- html_node(html_nodes(imdb, '.lister-item-content'), '.ratings-metascore span')
metascore <- html_text(metascore)
df <- data.frame(Title = title, Metascore = metascore)

Title Metascore
1 Mother! <NA>
2 Annabelle: Creation 62
3 Stranger Things <NA>
4 Supernatural <NA>
5 It <NA>
6 The Vampire Diaries <NA>
7 Get Out 84
8 The Originals <NA>
9 Annabelle 37
10 Grimm <NA>

How to get rvest or sapply to skip NA values?

As far as I see the target links, you can try the following way. First, you want to scrape all links from and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.


# Get all possible links from this webpage. There are 26665 links.

read_html("") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x

# Create complete URLs.
mylinks1 <- paste("", x, sep = "")

# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]

# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)

# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]

Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.

lapply(urls, function(x){

read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[@id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo

return(foo)}) -> mylist

Then, I assigned names to mylist with the links and created a data frame.

names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")

enframe(mylist) %>%

name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

