How to Scrape Tables Inside a Comment Tag in HTML with R

How to scrape tables inside a comment tag in html with R?

Ok..got it.

library(stringi)
library(knitr)
library(rvest)


any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))

e <- html_table(any_version_html(d))


> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====


kable(e[[1]],'rst')


=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University


## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.

# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')

=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========

Parsing table data from BeautifulSoup HTML Comment

The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment with lambda function to grab data.

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)

Output:

 Rk                      Name   Age  W  L   W-L%  ...    H9   HR9   BB9   SO9  SO/W  Notes
0 1.0 Logan Bursick-Harrington 21.0 0 2 0.000 ... 4.5 0.0 15.8 15.8 1.00 NaN
1 2.0 Cylis Cox* 19.0 1 0 1.000 ... 23.1 0.0 7.7 11.6 1.50 NaN
2 3.0 Travis Densmore* 21.0 0 1 0.000 ... 7.2 0.0 1.8 14.4 8.00 NaN
3 4.0 Dylan Freeman 22.0 1 0 1.000 ... 13.5 1.1 3.4 14.6 4.33 NaN
4 5.0 Zach Hopman* 22.0 0 1 0.000 ... 12.8 0.0 9.9 11.4 1.14 NaN
5 6.0 Eamon Horwedel 22.0 1 0 1.000 ... 9.0 0.0 6.4 6.4 1.00 NaN
6 7.0 Tyler Johnson 19.0 0 0 NaN ... 5.4 0.0 2.7 10.8 4.00 NaN
7 8.0 Trent Jones 20.0 0 0 NaN ... 14.6 1.1 2.3 12.4 5.50 NaN
8 9.0 Tanner Knapp 21.0 1 1 0.500 ... 11.6 0.0 7.7 4.8 0.63 NaN
9 10.0 Mason Majors 22.0 1 0 1.000 ... 4.9 0.0 7.4 12.3 1.67 NaN
10 11.0 Mason Meeks 21.0 0 1 0.000 ... 6.3 0.9 3.6 5.4 1.50 NaN
11 12.0 Sam Nagelvoort 19.0 0 1 0.000 ... 18.0 2.3 22.5 9.0 0.40 NaN
12 13.0 Tyler Nichol 20.0 0 0 NaN ... 27.0 0.0 27.0 0.0 0.00 NaN
13 14.0 Cole Russo 19.0 0 0 NaN ... 27.0 13.5 0.0 0.0 NaN NaN
14 15.0 Kyle Salley* 22.0 0 1 0.000 ... 9.0 2.3 22.5 9.0 0.40 NaN
15 16.0 Noah Stants 21.0 0 0 NaN ... 4.3 1.4 7.1 11.4 1.60 NaN
16 17.0 Quinn Waterhouse* 21.0 0 0 NaN ... 4.5 0.0 4.5 18.0 4.00 NaN
17 18.0 Nick Weyrich 19.0 0 0 NaN ... 6.4 1.3 7.7 11.6 1.50 NaN
18 19.0 Adam Wheaton 23.0 0 1 0.000 ... 11.7 1.8 4.5 12.6 2.80 NaN
19 NaN 19 Players 20.9 5 9 0.357 ... 9.2 0.8 6.9 10.7 1.55 NaN

[20 rows x 32 columns]

R: scrape nested html table with links (table within cell)

Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.

library(rvest)
library(stringr)
library(dplyr)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]


#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")

answer <-data.frame(leftside, rightside, links)

One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.

The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.

Update

To reduce the answer down to the desired 19 original rows:

library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")

#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))

How can I extract a specific table from a website that has multipe tables in R?

It is commented out. You can grab the comments with xpath then grab the table you want

library(rvest)

page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')

df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()

R web scraping packages failing to read in all tables of url

If you use a browser like Chrome you can go into settings and disable javascript. You will then see that only a few tables are present. The rest require javascript to run in order to load. Those are not being loaded, as displayed in browser, when you use your current method. Possible solutions are:

  1. Use a method like RSelenium which will allow javascript to run
  2. Inspect HTML of page to see if info is stored elsewhere and can be obtained from there. Sometimes info is retrieved from script tags, for example, where it is stored as json/javascript object
  3. Monitor network traffic when refreshing page (F12 to open dev tools and then Network tab) and see if you can find the source where the additional content is being loaded from. You may find other endpoints you can use).

Looking at the page it seems that at least two of those missing tables (likely all) are actually stored in comments in the returned html, associated with divs having class placeholder; and that you need to remove either the comments marks, or use a method that allows for parsing comments. Presumably, when javascript runs these comments are converted to displayed content.

Here is an example from the html:

Looking at this answer by @alistaire, one method is as follows (shown for single example table as per above image)

library(rvest)

h <- read_html('https://www.pro-football-reference.com/boxscores/201209050nyg.htm')

df <- h %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#game_info') %>%
html_table()

Using BeautifulSoup to scrape tables within comment tags

Here you go. You can get any table from that page only changing the index number.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text

soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))

As the other tables except for the first two are within javascript, that is why you need to use selenium to gatecrash and parse them. You will definitely be able to access any table from that page now. Here is the modified one.

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))

Scrape a URL with several tables with Rvest

The "Advanced" table is hidden under comments, hence it isn't directly accessible. We can get all the comments together using xpath and then parse the table from it.

library(rvest)
url = "https://www.basketball-reference.com/players/l/leonaka01.html"

url %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
toString() %>%
read_html() %>%
html_node('table#advanced') %>%
html_table()

# Season Age Tm Lg Pos G MP PER TS% 3PAr FTr ORB% ...
#1 2011-12 20 SAS NBA SF 64 1534 16.6 0.573 0.270 0.218 7.9 ...
#2 2012-13 21 SAS NBA SF 58 1810 16.4 0.592 0.331 0.240 4.3 ...
#3 2013-14 22 SAS NBA SF 66 1923 19.4 0.602 0.282 0.195 4.6 ...
#4 2014-15 23 SAS NBA SF 64 2033 22.0 0.567 0.234 0.307 4.8 ...
#5 2015-16 24 SAS NBA SF 72 2380 26.0 0.616 0.267 0.306 4.7 ...
#6 2016-17 25 SAS NBA SF 74 2474 27.6 0.610 0.295 0.406 3.7 ...
#7 2017-18 26 SAS NBA SF 9 210 26.0 0.572 0.315 0.342 3.1 ...
#8 2018-19 27 TOR NBA SF 60 2040 25.8 0.606 0.267 0.377 4.2 ...
#9 2019-20 28 LAC NBA SF 6 183 35.1 0.572 0.230 0.319 5.5 ...
#10 Career NA NBA 473 14587 22.8 0.599 0.276 0.318 4.8 ...
#11 NA NA NA NA NA NA NA NA ...
#12 7 seasons NA SAS NBA 407 12364 22.1 0.597 0.279 0.305 4.8 ...
#13 1 season NA TOR NBA 60 2040 25.8 0.606 0.267 0.377 4.2 ...
#14 1 season NA LAC NBA 6 183 35.1 0.572 0.230 0.319 5.5 ...

rvest - Scraping a Table

When you look at the raw list rw.list read in from html_table() there are three if-cases to be handled differently.

library(rvest)

path <- 'https://services2.hdb.gov.sg/webapp/AA16RMSBusinessDirectory/AA16SLevelmap?SearchOption=1&BLK=166&STREET=WOODLANDS+STREET+13++++++++++++++++++++++++++++++++++++++++++++++++++%EF%BF%BD&pcode=730166&STREETLIST=--&MAIN_TRADE_CODE=0000Please+Select+Category%24&Forward=&FROMHOME=true&Slvl=1&SEARCHPANEL=1&MAIN_TRADE_DESC'

# Parsing the HTML Code from Website
rw <- read_html(path)
rw.list <- html_table(rw)[-1]
names(rw.list) <- lapply(rw.list, function(x) # attribute clean names
unique(gsub("\\n|\\r|\\t|\\s+(More Information)?", "", x[1, ])))

l1 <- lapply(rw.list, function(x) t(x[-(1:2), ]))

l1 <- lapply(1:length(l1), function(x) {
d <- as.data.frame(l[[x]], stringsAsFactors=FALSE)
names(d) <- d[1, ]
if (length(d) == 10 | length(d) == 6)
out <- matrix(unlist(d[3, grep("Category|Trade|(Tel No)", names(d), )]),
ncol=2,
dimnames=list(NULL, d[1, 1:2]))
else if (length(d) == 8)
out <- matrix(unlist(t(d[3, grep("Category|Trade|(Tel No)", names(d), )])),
ncol=3, byrow=TRUE, dimnames=list(NULL, d[1, 1:3]))
else
out <- d[3, ]
return(cbind(id=names(l)[x], out))
})

The clean list we can merge with Reduce().

result <- Reduce(function(...) merge(..., all=TRUE), l1)

Result

head(result, 3)
# id Category Trade Tel No
# 1 1.GREENEMERALDAQUARIA Pets Aquarium Fish (freshwater/marine) And Accessories 68160208
# 2 2.SEEMRALICIOUS Beauty Beauty Salon 66357994
# 3 3.MORRISONOPTICALPTELTD Shopping Optical Goods & Eyewear 63666300


Related Topics



Leave a reply



Submit