Scraping a Complex HTML Table into a Data.Frame in R

Scraping a complex HTML table into a data.frame in R

Maybe like this

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
# [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"

R: scrape nested html table with links (table within cell)

Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.

library(rvest)
library(stringr)
library(dplyr)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]

#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")

answer <-data.frame(leftside, rightside, links)

One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.

The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.

Update

To reduce the answer down to the desired 19 original rows:

library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")

#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))

Scrape a table from a non-HTML website with R yet examples shown are for HTML

Actually, that's pretty straightfoward (based on @lukeA's answer):

library(rvest)

url <- "https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp"

page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
Code Main Specialty Title
1 Surgical Specialties Surgical Specialties Surgical Specialties
2 100 GENERAL SURGERY
3 101 UROLOGY
4 110 TRAUMA & ORTHOPAEDICS
5 120 ENT
6 130 OPHTHALMOLOGY

Selectorgadget can be installed here: Selectorgadget by Hadley Wickham

How to scrape HTML table in R using rvest for table with href in all columns?

As I mentioned in the comments, this particular page uses invalid HTML code. It has nothing to do with the href in all columns. Each row of the table is missing an opening <tr> tag. This makes it very difficult for automated tools to parse. Here's a hack to extract the cells, manually organize them into 8 columns and then part the data into a data.frame

url <- "http://modules.ussquash.com/ssm/pages/leagues/list_scorecard.asp?id=105252"
page <- url %>%
read_html()
table <- page %>% html_nodes(xpath='//*[@id="corebody"]/table[4]')
cells <- table %>%
html_nodes(xpath='.//td[not(@class="Line")]') %>%
html_text()
headers <- table %>%
html_nodes(xpath='.//th') %>%
html_text()

Ncol <- 8
rows <- sapply(split(cells, rep(1:(length(cells)/Ncol), each=Ncol)), paste, collapse="\t")
dd <- read.table(text=rows, header=F, sep="\t", col.names = headers)
head(dd)
# X. Bowdoin.College Rating Bates.College Rating.1
# 1 1S Squiers, Ian Lapsed-member Yousry, Mahmoud 5.724126
# 2 2S Butler, Satya P Lapsed-member Bonnell, Graham 5.845270
# 3 3S Cooley, George W Lapsed-member Attia, Omar 5.725947
# 4 4S Leech, Gannon 5.362050 Nambiar, Anirudh Lapsed-member
# 5 5S Shonrock, Tyler 5.576421 Abbott, McLeod Lapsed-member
# 6 6S Khanna, Uday 5.208504 McComish, Benni 5.448778
# Winner.s.Score Winner Status
# 1 11-8,11-4,11-6 Bates College C
# 2 11-4,11-5,11-5 Bates College C
# 3 12-10,11-9,11-9 Bates College C
# 4 11-7,11-6,11-4 Bates College C
# 5 4-11,11-6,11-5,11-7 Bates College C
# 6 11-3,11-8,5-11,11-9 Bates College C

rvest: Scraping table from webpage

That site is blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.

The html_nodes() function turns each HTML tag into a row in an R dataframe.

library(rvest)

## Loading required package: xml2

# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"

scistarter_html <- read_html(URL)
scistarter_html

## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...

We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.

The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.

This grabs all the nodes that have links in them.

    scistarter_html %>%
html_nodes("a") %>%
head()

## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] <a href="/dashboard">My Account</a>
## [3] <a href="/finder" class="is-active">Project Finder</a>
## [4] <a href="/events">Event Finder</a>
## [5] <a href="/people-finder">People Finder</a>
## [6] <a href="#dialog-login" rel="modal:open">log in</a>

In a more complex example, we could use this to “crawl” the page, but that’s for another day.

Every div on the page:

    scistarter_html %>%
html_nodes("div") %>%
head()

## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...

… the nav-tools div. This calls by css where class=nav-tools.

    scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()

## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...

We can call the nodes by id as follows.

    scistarter_html %>%
html_nodes("div#project-listing") %>%
head()

## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...

All the tables as follows:

    scistarter_html %>%
html_nodes("table") %>%
head()

## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...

See the (related) link below, for more info.

https://rpubs.com/Radcliffe/superbowl

R rvest: extracting html table that are loaded dynamically

As suggested by @AllanCameron we can extract the table using Rselenium and rvest. Here's a script that worked for me:

library(RSelenium)
library(rvest)
library(magrittr)

URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"

# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]

# Extract table from source
DF <- read_html(html) %>%
html_nodes("table") %>%
`[[`(3) %>%
html_table %>% data.frame

# Close connection
remDr$close()

Extracting multiple tables from webpage containing hyperlinks with R

This is more of a blog post or tutorial than an SO answer but I can also appreciate the desire to learn and am also working on a book for this very topic and this seems like a gd example.

library(rvest)
library(tidyverse)

We'll start with the top-level page:

pg <- read_html("https://www.eia.gov/naturalgas/archive/petrosystem/petrosysog.html")

Now, we'll use an XPath that only gets us table rows that have state data in them. Compare the XPath expressions to the tags in the HTML and this should make sense. Find all <tr>s without colspan attributes and only choose remaining <tr>s that have both the right class and a link to a state:

states <- html_nodes(pg, xpath=".//tr[td[not(@colspan) and 
contains(@class, 'links_normal') and a[@name]]]")

data_frame(
state = html_text(html_nodes(states, xpath=".//td[1]")),
link = html_attr(html_nodes(states, xpath=".//td[2]/a"), "href")
) -> state_tab

It's in a data frame to keep it tidy and handy.

You'll need to put the next bit below the function that comes after it, but I need to explain the iteration before showing the function.

We need to iterate over each link. In each iteration, we:

  • pause since your needs aren't more important than EIA's server load
  • find all "branch" <div>s since they hold two pieces of information we need (the state+year and the data table for said state+year).
  • wrap it all up in a nice data frame

Rather than clutter up the anonymous function, we'll put that functionality in another function (again, which needs to be defined before this iterator will work):

pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{

pb$tick()$print()

pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))

Sys.sleep(5) # scrape responsibly

html_nodes(pg, xpath=".//div[@class='branch']") %>%
map_df(extract_table)

}) -> og_df

This is the hard worker of the bunch. We need to find all the State + Year labels on the page (each are in a <table>) then we need to find the tables with data in them. I take the liberty of removing the explanatory blurb at the bottom of each and also turn each into a tibble (but that's just my class preference):

extract_table <- function(pg) {

t1 <- html_nodes(pg, xpath=".//../tr[td[contains(@class, 'SystemTitle')]][1]")
t2 <- html_nodes(pg, xpath=".//table[contains(@summary, 'Report')]")

state_year <- (html_text(t1, trim=TRUE) %>% strsplit(" "))[[1]]

xml_find_first(t2, "td[@colspan]") %>% xml_remove()

html_table(t2, header=FALSE)[[1]] %>%
mutate(state=state_year[1], year=state_year[2]) %>%
tbl_df()

}

Re-pasting the aforeposted code just to ensure you grok it has to come after the function:

pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{

pb$tick()$print()

pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))

Sys.sleep(5) # scrape responsibly

html_nodes(pg, xpath=".//div[@class='branch']") %>%
map_df(extract_table)

}) -> og_df

And, it works (you said you'd do the final cleanup separately):

glimpse(og_df)
## Observations: 14,028
## Variables: 19
## $ X1 <chr> "", "Prod.RateBracket(BOE/Day)", "0 - 1", "1 - 2", "2 - 4", "4 - 6", "...
## $ X2 <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X3 <chr> "Oil Wells", "# ofOilWells", "26", "19", "61", "61", "47", "36", "250"...
## $ X4 <chr> "Oil Wells", "% ofOilWells", "5.2", "3.8", "12.1", "12.1", "9.3", "7.1...
## $ X5 <chr> "Oil Wells", "AnnualOilProd.(Mbbl)", "4.1", "7.8", "61.6", "104.9", "1...
## $ X6 <chr> "Oil Wells", "% ofOilProd.", "0.1", "0.2", "1.2", "2.1", "2.2", "2.3",...
## $ X7 <chr> "Oil Wells", "OilRateper Well(bbl/Day)", "0.5", "1.4", "3.0", "4.9", "...
## $ X8 <chr> "Oil Wells", "AnnualGasProd.(MMcf)", "1.5", "3.5", "16.5", "19.9", "9....
## $ X9 <chr> "Oil Wells", "GasRateper Well(Mcf/Day)", "0.2", "0.6", "0.8", "0.9", "...
## $ X10 <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X11 <chr> "Gas Wells", "# ofGasWells", "365", "331", "988", "948", "867", "674",...
## $ X12 <chr> "Gas Wells", "% ofGasWells", "5.9", "5.4", "16.0", "15.4", "14.1", "10...
## $ X13 <chr> "Gas Wells", "AnnualGasProd.(MMcf)", "257.6", "1,044.3", "6,360.6", "1...
## $ X14 <chr> "Gas Wells", "% ofGasProd.", "0.1", "0.4", "2.6", "4.2", "5.3", "5.4",...
## $ X15 <chr> "Gas Wells", "GasRateper Well(Mcf/Day)", "2.2", "9.2", "18.1", "30.0",...
## $ X16 <chr> "Gas Wells", "AnnualOilProd.(Mbbl)", "0.2", "0.6", "1.6", "2.0", "2.4"...
## $ X17 <chr> "Gas Wells", "OilRateper Well(bbl/Day)", "0.0", "0.0", "0.0", "0.0", "...
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ year <chr> "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009"...

Web Scraping in R with loop from data.frame

The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:

library(tidyverse)
library(rvest)

base_url <- "https://www.whatmobile.com.pk/"

models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
link = paste0(base_url, model),
page = map(link, read_html))

model_specs <- models %>%
mutate(node = map(page, html_node, '.specs'),
specs = map(node, html_table, header = TRUE, fill = TRUE),
specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>%
select(model, specs) %>%
unnest()

model_specs
#> # A tibble: 119 x 5
#> model var1 var2
#> <chr> <chr> <chr>
#> 1 Qmobile_Noir-M6 Build OS
#> 2 Qmobile_Noir-M6 Build Dimensions
#> 3 Qmobile_Noir-M6 Build Weight
#> 4 Qmobile_Noir-M6 Build SIM
#> 5 Qmobile_Noir-M6 Build Colors
#> 6 Qmobile_Noir-M6 Frequency 2G Band
#> 7 Qmobile_Noir-M6 Frequency 3G Band
#> 8 Qmobile_Noir-M6 Frequency 4G Band
#> 9 Qmobile_Noir-M6 Processor CPU
#> 10 Qmobile_Noir-M6 Processor Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>

The data is still pretty messy, but at least it's all there.



Related Topics



Leave a reply



Submit