Web Scraping of Key Stats in Yahoo! Finance with R

Web scraping of key stats in Yahoo! Finance with R

I gave up on Excel a long time ago. R is definitely the way to go for things like this.

library(XML)

stocks <- c("AXP","BA","CAT","CSCO")

for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")

# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))

# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}

# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]

# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)

# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

Webscraping with Yahoo Finance

I rewrote the answer to get only the fundamental data. First of, instead of a loop I put your scrape request into a function. Next I wrote an error catcher function loosely based on possibly function from purrr. This to be able to return a function instead of a default value. Then you can use map_df to loop over all the ticker symbols. Whenever there is an error, the data will be NA but will show the ticker and fill in an error column.

If speed is an issue, you might look into the furrr package to be able to run all of this in parallel.

library(rvest)
library(purrr)
library(dplyr)

get_stats <- function(symbol) {
url <- paste0("https://finance.yahoo.com/quote/",symbol,"/key-statistics?p=", symbol)
df <- url %>%
read_html() %>%
html_table(header = FALSE) %>%
map_df(bind_cols) %>%
as_tibble()

names(df) <- c("valuation_measures", "value")
df["stock"] <- symbol

return(df)
}

catch_error <- function(.f, otherwise=NULL) {
function(...) {
tryCatch({
.f(...)
}, error = function(e) otherwise(...))
}
}

tickers <- c("xxxxxx", "AAPL")

out <- map_df(tickers, catch_error(get_stats, otherwise = function(x) tibble(valuation_measures = NA_character_, value = NA_character_, stock = x, error = "error in getting data")))

# A tibble: 60 x 4
valuation_measures value stock error
<chr> <chr> <chr> <chr>
1 NA NA xxxxxx error in getting data
2 Market Cap (intraday) 5 1.22T AAPL NA
3 Enterprise Value 3 1.23T AAPL NA
4 Trailing P/E 22.07 AAPL NA
5 Forward P/E 1 17.81 AAPL NA
6 PEG Ratio (5 yr expected) 1 1.52 AAPL NA
7 Price/Sales (ttm) 4.54 AAPL NA
8 Price/Book (mrq) 13.61 AAPL NA
9 Enterprise Value/Revenue 3 4.58 AAPL NA
10 Enterprise Value/EBITDA 6 15.69 AAPL NA
# ... with 50 more rows

R: web scraping yahoo.finance after 2019 change

As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.

library(rvest)
library(tidyverse)

url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
# Create empty data frame with the required dimentions
df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
stringsAsFactors = F)

# Fill the table looping through rows
for (i in 1:length(raw_table)) {
# Find the row name and set it.
df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
# Now grab the values
row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
for (j in 1:(number_of_columns - 1)) {
df[i, j+1] <- row_values[j] %>% html_text()
}
}
view(df)

In R Studio how would I pull a specific piece of data from Yahoo Finance

You can try this.

If you try to call Book Value with an additional value, it is working. However, when you call only Book Value, the function is giving an error. You can post this bug to the package provider's GitHub. Here is the link

The following producer might work for you for now.


getQuote("AAPL",what=yahooQF(c("Open","Book Value")))

>
Trade Time Open Book Value
AAPL 2021-04-30 16:00:04 131.78 4.146

I just requested additional Open price information. With that, API worked.

Extracting data from Yahoo Finance

There is actually a link to a downloadable csv on the page. You can specify the company and time period etc, in R prior to downloading it.

The only problem is that the url needs to include a parameter called "crumbs" which is effectively your session ID, and you'll need to harvest this from the page to construct the url.

I have written a function to do all this. All you need to do is supply the company name and it will give you all financial records in the last 15 years or so as a data frame.

  get_financial_data <- function(company)
{
tmp <- tempfile()
yahoo <- httr::handle('https://finance.yahoo.com/quote/')

'https://finance.yahoo.com/quote/' %>%
paste0(company) %>%
paste0('/history?p=') %>%
paste0(company) %>%
httr::GET(handle = yahoo) %>%
httr::content("text") -> raw_html

strsplit(raw_html, "crumb\":\"") %>%
unlist %>%
strsplit("\"") %>%
lapply(`[`, 1) %>%
unlist %>%
table %>%
sort %>%
rev %>%
names %>%
`[`(2) %>%
paste0("https://query1.finance.yahoo.com/v7/finance/download/", company,
"?period1=1079873500&period2=1579873500&interval=1d&events=history&crumb=", .) %>%
httr::GET(handle = yahoo) %>%
httr::content("text", encoding = "UTF-8") %>%
writeBin(tmp, useBytes = T)

suppressWarnings(read.csv(tmp) %>% tibble::as_tibble())
}

Now you can just do:

get_financial_data("SAP")
#> # A tibble: 3,988 x 7
#> Date Open High Low Close Adj.Close Volume
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 2004-03-22 37.4 37.5 36.7 37.3 29.3 1717600
#> 2 2004-03-23 37.8 37.9 37.4 37.5 29.5 1417500
#> 3 2004-03-24 37.4 38.1 37.2 37.5 29.5 1682500
#> 4 2004-03-25 37.9 38.9 37.9 38.8 30.4 2233800
#> 5 2004-03-26 37.9 38.2 37.5 37.9 29.8 3343500
#> 6 2004-03-29 38.6 39.3 38.6 38.8 30.5 1886800
#> 7 2004-03-30 38.8 39.3 38.8 39.1 30.7 997900
#> 8 2004-03-31 39.5 39.8 39.3 39.3 30.9 1818600
#> 9 2004-04-01 39.8 40.6 39.8 40.6 31.9 2570600
#> 10 2004-04-02 40.9 41.3 40.6 41.1 32.3 1708200
#> # ... with 3,978 more rows

Adjusting Yahoo Stock Data Web Scraping to Loop over Dates

Consider mapply or its non-simplified wrapper, Map, to iterate elementwise through the pairings of start and end dates and corresponding symbols. Also, avoid the use of assign and get and build list of data frames for final rbind at the end:

library(XML)
...
dateGroup <- data.frame(
start = c(1509519600, 1518159600, 1526799600, 1535439600, 1544079600),
end = c(1518073200, 1526713200, 1535353200, 1543993200, 1550732400)
)

# CROSS JOIN ALL SYMBOLS WITH EACH DATE PAIRING
dt_grp_sym <- merge(dateGroup, data.frame(symbols))

# DEFINED METHOD FOR HTML PROCESSING
proc_html <- function(sym, sd, ed) {
url <- paste0('https://finance.yahoo.com/quote/', sym, '/history?period1=',
sd, '&period2=', ed, '&interval=1d&filter=history&frequency=1d')
print(url)

webpage <- readLines(url, warn=FALSE)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")

html_df <- transform(readHTMLTable(tableNodes[[1]],
header=c("Date", "Open", "High", "Low",
"Close", "Adj. Close", "Volume")),
symbol = sym)
return(html_df)
}

# ITERATE ELEMENTWISE THROUGH EVERY ROW of dt_grp_sym
df_list <- Map(proc_html, dt_grp_sym$symbols, dt_grp_sym$start, dt_grp_sym$end)

final_df <- do.call(rbind, df_list)

To demonstrate using the Class 1 U.S. railroads:

symbols <- c("UNP", "CSX", "NSC", "CNI", "KSU")

dateGroup <- data.frame(
start = c(1509519600, 1518159600, 1526799600, 1535439600, 1544079600),
end = c(1518073200, 1526713200, 1535353200, 1543993200, 1550732400)
)
dt_grp_sym <- merge(dateGroup, data.frame(symbols))

# CALLING SAME ABOVE FUNCTION
df_list <- with(dt_grp_sym, Map(proc_html, symbols, start, end))
final_df <- do.call(rbind, df_list)

Output

by(final_df, final_df$symbol, head)

# final_df$symbol: CNI
# Date Open High Low Close Adj..Close Volume symbol
# 998 Feb 08, 2018 76.08 76.16 74.11 74.45 72.79 1,508,100 CNI
# 999 Feb 07, 2018 76.86 77.23 76.01 76.17 74.48 1,645,400 CNI
# 1000 Feb 06, 2018 76.21 77.42 74.81 77.14 75.42 2,293,300 CNI
# 1001 Feb 05, 2018 78.00 78.70 77.12 77.17 75.45 1,711,000 CNI
# 1002 Feb 02, 2018 79.17 79.24 78.17 78.46 76.71 1,331,400 CNI
# 1003 Feb 01, 2018 79.91 80.54 79.24 79.82 78.04 1,231,500 CNI
# ------------------------------------------------------------------------------
# final_df$symbol: CSX
# Date Open High Low Close Adj..Close Volume symbol
# 333 Feb 08, 2018 52.91 53.16 50.46 50.47 49.80 7,798,100 CSX
# 334 Feb 07, 2018 53.38 54.36 52.94 52.97 52.26 6,496,200 CSX
# 335 Feb 06, 2018 51.27 54.00 50.12 53.82 53.10 10,563,700 CSX
# 336 Feb 05, 2018 54.89 55.04 51.96 51.99 51.30 9,070,200 CSX
# 337 Feb 02, 2018 56.19 56.35 55.20 55.25 54.51 9,275,800 CSX
# 338 Feb 01, 2018 56.10 57.10 56.04 56.58 55.83 4,079,100 CSX
# ------------------------------------------------------------------------------
# final_df$symbol: KSU
# Date Open High Low Close Adj..Close Volume symbol
# 1330 Feb 08, 2018 107.17 107.64 103.50 103.53 102.15 1,434,600 KSU
# 1331 Feb 07, 2018 106.59 108.27 106.59 107.10 105.67 1,326,800 KSU
# 1332 Feb 06, 2018 103.11 108.02 102.07 107.32 105.89 1,459,400 KSU
# 1333 Feb 05, 2018 109.73 110.44 105.12 105.18 103.77 1,272,100 KSU
# 1334 Feb 02, 2018 112.06 112.85 110.03 110.15 108.68 1,051,900 KSU
# 1335 Feb 01, 2018 112.80 114.00 112.17 112.87 111.36 1,011,200 KSU
# ------------------------------------------------------------------------------
# final_df$symbol: NSC
# Date Open High Low Close Adj..Close Volume symbol
# 665 Feb 08, 2018 142.62 143.27 136.87 136.89 134.22 2,657,200 NSC
# 666 Feb 07, 2018 142.09 144.45 141.37 142.68 139.89 1,464,500 NSC
# 667 Feb 06, 2018 136.99 143.45 134.55 143.05 140.26 2,455,000 NSC
# 668 Feb 05, 2018 144.74 146.73 138.18 138.61 135.90 2,508,900 NSC
# 669 Feb 02, 2018 147.15 147.85 144.61 145.03 142.20 1,774,600 NSC
# 670 Feb 01, 2018 149.28 150.35 147.90 148.47 145.57 1,427,000 NSC
# ------------------------------------------------------------------------------
# final_df$symbol: UNP
# Date Open High Low Close Adj..Close Volume symbol
# 1 Feb 08, 2018 128.70 128.70 124.81 124.86 122.27 6,325,100 UNP
# 2 Feb 07, 2018 130.34 131.82 128.94 128.96 126.29 5,053,000 UNP
# 3 Feb 06, 2018 122.28 131.50 121.50 131.15 128.43 15,734,300 UNP
# 4 Feb 05, 2018 128.59 131.78 124.13 124.14 121.57 6,744,400 UNP
# 5 Feb 02, 2018 131.66 131.73 127.22 129.36 126.68 8,181,200 UNP
# 6 Feb 01, 2018 132.51 133.74 131.86 132.38 129.64 5,597,600 UNP


Related Topics



Leave a reply



Submit