Extracting HTML Table from a Website in R

Extracting html table from a website in R

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...

As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

extract table from webpage using R

This will get all of the tables into a single data frame:

library(tidyverse)
library(rvest)

url <- "https://www.iii.org/table-archive/21110"

df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))

tables <- data.frame()

for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}

You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:

table_2009 <- tables[21:30,] %>% 
mutate(Year = 2009)

To add all the years at once:

years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))

Hope this helps.

How can I scrape a table from a website in R

This is a dynamic page, with the table generated by Javascript.
rvest alone will not suffice. Nonetheless, you could get the source content from the JSON API.

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <-
read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <-
html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
"https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
str_apikey,
"&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <-
json_obj$observations %>% as_tibble() %>%
mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>%
with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

Results: A nice table

# A tibble: 34 x 8
valid_time_gmt temp dewPt rh wdir_cardinal gust pressure precip_hrly
<dttm> <int> <int> <int> <chr> <lgl> <dbl> <dbl>
1 2010-12-31 23:52:00 38 NA 79 CALM NA 30.1 NA
2 2011-01-01 00:52:00 35 31 85 CALM NA 30.1 NA
3 2011-01-01 01:52:00 36 31 82 CALM NA 30.1 NA
4 2011-01-01 02:52:00 37 31 79 CALM NA 30.1 NA
5 2011-01-01 03:52:00 36 30 79 CALM NA 30.1 NA
6 2011-01-01 04:52:00 37 30 76 NNE NA 30.1 NA
7 2011-01-01 05:52:00 36 30 79 CALM NA 30.1 NA
8 2011-01-01 06:52:00 34 30 85 CALM NA 30.1 NA
9 2011-01-01 07:52:00 37 31 79 CALM NA 30.1 NA
10 2011-01-01 08:52:00 44 38 79 CALM NA 30.1 NA
# ... with 24 more rows

Read HTML table using R directly from a website

You can't scrape the data in the table by rvest because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key attached.

pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data

I don't know if the api-key works in the future but it works for now as I see.

How can I extract a specific table from a website that has multipe tables in R?

It is commented out. You can grab the comments with xpath then grab the table you want

library(rvest)

page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')

df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()

Scraping html tables into R data frames using the XML package

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

R rvest: extracting html table that are loaded dynamically

As suggested by @AllanCameron we can extract the table using Rselenium and rvest. Here's a script that worked for me:

library(RSelenium)
library(rvest)
library(magrittr)

URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"

# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]

# Extract table from source
DF <- read_html(html) %>%
html_nodes("table") %>%
`[[`(3) %>%
html_table %>% data.frame

# Close connection
remDr$close()

How to write rscript to to extract URL from HTML table

The links are coming from an additional GET request made by the browser which returns xml. You can still go with rvest and grab the Key nodes then complete the urls.

library(rvest)

base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)

For older R versions, swop |> with %>% and add library(magrittr) as import.

Scrape HTML table with R

This gets you the table you are looking for:

library(rvest)
mrkt_snpshot %>%
html_nodes("table") %>%
html_table(fill=T) %>%
.[[24]]


Related Topics



Leave a reply



Submit