Extracting html table from a website in R
Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number)
will do pretty well.) The arrows are images which is why the last column is all NA
, but you can calculate those anyway.
extract table from webpage using R
This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.
How can I scrape a table from a website in R
This is a dynamic page, with the table generated by Javascript.rvest
alone will not suffice. Nonetheless, you could get the source content from the JSON API.
library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)
# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.
htm_obj <-
read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')
# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <-
html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )
# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
"https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
str_apikey,
"&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)
# Wrangle the JSON's contents into the table you need
tbl_daily <-
json_obj$observations %>% as_tibble() %>%
mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>%
with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table
Results: A nice table
# A tibble: 34 x 8
valid_time_gmt temp dewPt rh wdir_cardinal gust pressure precip_hrly
<dttm> <int> <int> <int> <chr> <lgl> <dbl> <dbl>
1 2010-12-31 23:52:00 38 NA 79 CALM NA 30.1 NA
2 2011-01-01 00:52:00 35 31 85 CALM NA 30.1 NA
3 2011-01-01 01:52:00 36 31 82 CALM NA 30.1 NA
4 2011-01-01 02:52:00 37 31 79 CALM NA 30.1 NA
5 2011-01-01 03:52:00 36 30 79 CALM NA 30.1 NA
6 2011-01-01 04:52:00 37 30 76 NNE NA 30.1 NA
7 2011-01-01 05:52:00 36 30 79 CALM NA 30.1 NA
8 2011-01-01 06:52:00 34 30 85 CALM NA 30.1 NA
9 2011-01-01 07:52:00 37 31 79 CALM NA 30.1 NA
10 2011-01-01 08:52:00 44 38 79 CALM NA 30.1 NA
# ... with 24 more rows
Read HTML table using R directly from a website
You can't scrape the data in the table by rvest
because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key
attached.
pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data
I don't know if the api-key
works in the future but it works for now as I see.
How can I extract a specific table from a website that has multipe tables in R?
It is commented out. You can grab the comments with xpath then grab the table you want
library(rvest)
page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')
df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()
Scraping html tables into R data frames using the XML package
…or a shorter try:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page
tables[[which.max(n.rows)]]
R rvest: extracting html table that are loaded dynamically
As suggested by @AllanCameron we can extract the table using Rselenium
and rvest
. Here's a script that worked for me:
library(RSelenium)
library(rvest)
library(magrittr)
URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"
# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]
# Extract table from source
DF <- read_html(html) %>%
html_nodes("table") %>%
`[[`(3) %>%
html_table %>% data.frame
# Close connection
remDr$close()
How to write rscript to to extract URL from HTML table
The links are coming from an additional GET request made by the browser which returns xml. You can still go with rvest and grab the Key nodes then complete the urls.
library(rvest)
base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)
For older R versions, swop |>
with %>%
and add library(magrittr)
as import.
Scrape HTML table with R
This gets you the table you are looking for:
library(rvest)
mrkt_snpshot %>%
html_nodes("table") %>%
html_table(fill=T) %>%
.[[24]]
Related Topics
How to Add a Legend for the Secondary Axis Ggplot
R: Using "Microbenchmark" and Ggplot2 to Plot Runtimes
Place 1 Heatmap on Another with Transparency in R
Replace Column Values with Column Name Using Dplyr's Transmute_All
Backports 1.1.1 Package Fails to Install
Web Scraping a Tableauviz into an R Dataframe
Add New Value to New Column Based on If Value Exists in Other Dataframe in R
Control Padding of Grobs Added to Patchwork
Selecting Multiple Columns in Data Frame Using Partial Column Name
Non-Standard Evaluation and Quasiquotation in Dplyr() Not Working as (Naively) Expected
Changes in Plotting an Xts Object
R Packages Fail to Compile with Gcc
Complete Time Series by Group in R
Changing Names in a List of Dataframes
Cumulative Minimum Value by Group