Inputting NA where there are missing values when scraping with rvest
The simplest way is to select a node that encloses both of the nodes you want for each row, then iterate over them, pulling out both of the nodes you want at once. purrr::map_df
is handy for not only iterating, but even combining the results into a nice tibble:
library(rvest)
library(purrr)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
page <- read_html(url)
df <- page %>%
html_nodes('article') %>% # select enclosing nodes
# iterate over each, pulling out desired parts and coerce to data.frame
map_df(~list(title = html_nodes(.x, 'h3 a') %>%
html_text() %>%
{if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, '.tile .caption') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))
df
#> # A tibble: 12 x 2
#> title length
#> <chr> <chr>
#> 1 Introduction to Natural Language Processing with R II 01:15:00
#> 2 Introduction to Natural Language Processing with R 01:22:13
#> 3 Solving iteration problems with purrr II 01:22:49
#> 4 Solving iteration problems with purrr 01:32:23
#> 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
#> 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
#> 7 Actuarial and statistical aspects of reinsurance in R 14:15
#> 8 Transformation Forests 16:19
#> 9 Room 2.02 Lightning Talks 50:35
#> 10 R and Haskell: Combining the best of two worlds 14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment <NA>
#> 12 Performance Benchmarking of the R Programming Environment on Knight's Landing 19:32
Rvest scraping child nodes but filling missing values with NA
If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.
Swap out html_elements
for html_element
.
You also need to amend your xpaths to avoid getting the first node value repeated for each row.
library(tidyverse)
library(httr)
headers <- c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))
r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)
Placeholder NA for missing value with rvest - with XPath
Your xpath should be header/h3/a
, not /header/h3/a
. The leading slash would imply you want to start at the root of the tree again, not the current node.
xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%
map_df(~list(title = html_nodes(.x, xpath = 'header/h3/a') %>%
html_text() %>% {if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, xpath = 'a/time') %>%
html_text() %>% {if(length(.) == 0) NA else .}))
# title length
# <chr> <chr>
# 1 " Introduction to Natural Language Processing with R II" 01:15:00
# 2 " Introduction to Natural Language Processing with R" 01:22:13
# 3 " Solving iteration problems with purrr II" 01:22:49
# 4 " Solving iteration problems with purrr" 01:32:23
# 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
# 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
# 7 Actuarial and statistical aspects of reinsurance in R 14:15
# 8 Transformation Forests 16:19
How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results
I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"
here is the code:
library(rvest)
# teams
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
table <- NULL
# loop
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[@id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
as.data.frame()
# bind to dataframe
df <- rbind(df,table)
}
Now there is just one question left
How can i merge "df" and "firmen" into one table which has the columns:
"tickers" = df and "firmen" = firmen
because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.
basically i need to transform the list "firmen" but i don't know how
Thank you for the help
Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?
You need to change your fourth line. You want metascore
to have as many elements as title
, with NA
for those title
s that don't have a metascore
listed. The way to do this is to extract the item-content
nodes, and then, from each of these, to select the ratings-metascore
node if it exists, or NA
if it doesn't. See ?html_nodes
for the difference between html_node
and html_nodes
. I've also added span
to ensure that just the number is returned, without the following word 'metascore'.
imdb <- read_html("http://www.imdb.com/search/title?genres=horror&genres=mystery&sort=moviemeter,asc&view=advanced")
title <- html_nodes(imdb, '.lister-item-header a')
title <- html_text(title)
metascore <- html_node(html_nodes(imdb, '.lister-item-content'), '.ratings-metascore span')
metascore <- html_text(metascore)
df <- data.frame(Title = title, Metascore = metascore)
head(df,10)
Title Metascore
1 Mother! <NA>
2 Annabelle: Creation 62
3 Stranger Things <NA>
4 Supernatural <NA>
5 It <NA>
6 The Vampire Diaries <NA>
7 Get Out 84
8 The Originals <NA>
9 Annabelle 37
10 Grimm <NA>
How to get rvest or sapply to skip NA values?
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/
and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3
. So I tried to specifically target h3
that stays in xpath containing id = "affiliation"
. If there is no affiliation information, R returns character(0)
. When enframe()
is applied, these elements are removed. For instance, pab127
does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[@id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist
with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~
Related Topics
Filling Missing Dates in a Grouped Time Series - a Tidyverse-Way
Add Text to Horizontal Barplot in R, Y-Axis at Different Scale
Shift Values in Single Column of Dataframe Up
Replace Empty Values with Value from Other Column in a Dataframe
How to Make Grouped Layout in Igraph
Convert a Date Vector into Julian Day in R
Create Column with Grouped Values Based on Another Column
Group by and Filter Data Management Using Dplyr
How to Implement a Cleanup Routine in R Shiny
Read and Rbind Multiple CSV Files
Rounding Numbers in R to Specified Number of Digits
Divide Row Value by Aggregated Sum in R Data.Frame
Ggplot2: How to Use Same Colors in Different Plots for Same Factor
Advantages of Reactive VS. Observe VS. Observeevent
Lda with Topicmodels, How to See Which Topics Different Documents Belong To