rvest: select and scrape html table(s) after a specific (title) string
library(rvest)
library(stringr)
doc <- read_html("https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d")
# extract all the nodes that have the title (id = "tytul") or a table
# the cs selector "," is like a boolean OR.
nodes <- doc %>% html_nodes(".tytul,table")
# loop though each node.
signal <- FALSE
my_tables <- list()
j <- 0
for (i in 1:length(nodes)) {
# if title signal previously set and this is a table tag
if (signal & html_name(nodes[i]) == "table") {
cat("Match..\n")
# get the table (data frame)
this_table <- html_table(nodes[i], fill = TRUE, header = TRUE)[[1]]
# append to list
j = j + 1
my_tables[[j]] <- this_table
# and reset the signal so we search for the next one
signal <- FALSE
}
# if the signal is clear look for matching title
if (!signal) {
signal <- nodes[i] %>% html_text() %>% str_detect("Tabela.+ICD 9")
}
}
my_tables[[1]][1:5,]
my_tables[[2]][1:5,]
# > my_tables[[1]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 2.051 ZaĹ\u0082oĹźenie pĹ\u0082ytki sztucznej do czaszki 168 32,31 7
# 2 1.247 Kraniotomia z usuniÄ\u0099ciem krwiaka podtwardĂłwkowego 55 10,58 20
# 3 2.022 Odbarczenie zĹ\u0082amania czaszki 43 8,27 6
# 4 2.040 Przeszczep kostny do koĹ\u009bci czaszki 35 6,73 8
# 5 1.093 Inne aspiracje w zakresie czaszki 33 6,35 5
# > my_tables[[2]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 O35 SĂłd (Na) 239 45,96 8
# 2 89.00 Porada lekarska, konsultacja, asysta 230 44,23 9
# 3 N45 Potas (K) 217 41,73 8
# 4 87.030 TK gĹ\u0082owy bez kontrastu 214 41,15 9
# 5 89.04 Opieka pielÄ\u0099gniarki lub poĹ\u0082oĹźnej 202 38,85 8
Web Scraping, extract table of a page
It makes an XHR request to another resource which is used to make the table.
library(rvest)
library(dplyr)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")
html_nodes(pg, "table") %>%
html_table() %>%
.[[1]] %>%
tbl_df() %>%
select(1:2)
## # A tibble: 36 × 2
## R.U.T. Entidad
## <chr> <chr>
## 1 99588060-1 ACE SEGUROS DE VIDA S.A.
## 2 76511423-3 ALEMANA SEGUROS S.A.
## 3 96917990-3 BANCHILE SEGUROS DE VIDA S.A.
## 4 96933770-3 BBVA SEGUROS DE VIDA S.A.
## 5 96573600-K BCI SEGUROS VIDA S.A.
## 6 96656410-5 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7 96837630-6 BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8 76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9 76477116-8 CF SEGUROS DE VIDA S.A.
## 10 99185000-7 CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows
You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.
How to scrape all rows from a dynamic table in html using python
You can read it directly using pandas.read_html()
function as a DataFrame
which will do it easily for you.
import pandas as pd
def main(url):
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
print(df)
main("http://5000best.com/websites/Games/{}/")
Sample of output:
CSV edit:
import pandas as pd
def main(url):
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
print(f"Saving Page {item}")
df.to_csv(f"page{item}.csv", index=False)
main("http://5000best.com/websites/Games/{}/")
Code updated for single DataFrame
:
import pandas as pd
def main(url):
goal = []
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
goal.append(df)
final = pd.concat(goal)
print(final)
main("http://5000best.com/websites/Games/{}/")
How to find a table with Web::Scraper based on cell values?
You need to use an XPath expression to look at the text content of the nodes.
This should do the trick
my $table = scraper {
process '//table[tr[1]/th[1][normalize-space(text())="Content"]]/tr', 'rows[]' => scraper {
process 'th', 'header' => 'TEXT';
process 'td', 'cols[]' => 'TEXT';
};
};
It may look complex, but it's OK if you break it down.
It selects all <tr>
elements that are children of any <table>
element beneath the root for which the first <th>
element of the first <tr>
element contains a text element equal to "Content"
when normalized (leading and trailing spaces stripped).
output
---
rows:
- cols:
- col-1
- col-n
header: Content
- cols:
- 2012
- 2001
header: Date
- cols:
- val-1
- val-n
header: Banana
Scrape tables by passing multiple search requests using R
The call needs is to https: and not http:. I also removed the plyr library used just base R:
library(rvest)
fn = rep(c('HARVEY','HARVEY'));
ln = rep(c('BIDWELL','ADELSON'));
mydf = data.frame(fn,ln);
get_data = function(df){
root = 'https://npiregistry.cms.hhs.gov/'
u = paste(root,'registry/search-results-table?','first_name=', df[1], '&last_name=',
df[2], sep = "");
# encode url correctly
url = URLencode(u);
#print(url)
# extract data from the right table
data = read_html(url);
newresult<- html_nodes(data, "table")[1] %>%html_table()
# convert result into a data frame
newresult<-as.data.frame(newresult)
}
mydata = apply(mydf, 1, function(x) { get_data(x)})
#mydata is a list of data frames, do.call creates a single data.frame
finalanswer<-do.call(rbind, mydata)
#finalanswer needs some clean up.
Web scraping multiple tables from a single webpage
You can apply pandas to pull those tables data easily.
import pandas as pd
df =pd.read_html('https://www.sports-reference.com/cbb/players/jaden-ivey-1.html')[0:5]
print(df)
Output:
[ Season School Conf G GS MP FG ... STL BLK TOV PF PTS Unnamed: 27
SOS
0 2020-21 Purdue Big Ten 23 12 24.2 3.9 ... 0.7 0.7 1.3 1.7 11.1 NaN 11.23
1 2021-22 Purdue Big Ten 36 34 31.4 5.6 ... 0.9 0.6 2.6 1.8 17.3 NaN 8.23
2 Career Purdue NaN 59 46 28.6 4.9 ... 0.8 0.6 2.1 1.7 14.9 NaN 9.73
[3 rows x 29 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 19 10 23.3 3.5 9.2 ... 2.7 3.6 2.1 0.8 0.7 1.4 1.6
10.3
1 2021-22 Purdue Big Ten 19 17 32.6 5.5 12.8 ... 3.3 4.2 2.9 0.9 0.5 2.5 1.9
17.5
2 Career Purdue NaN 38 27 27.9 4.5 11.0 ... 3.0 3.9 2.5 0.9 0.6 1.9 1.8
13.9
[3 rows x 27 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 23 12 557 89 223 ... 57 76 43 17 16 31 39 256
1 2021-22 Purdue Big Ten 36 34 1132 203 441 ... 152 176 110 33 20 94 63 624
2 Career Purdue NaN 59 46 1689 292 664 ... 209 252 153 50 36 125 102 880
[3 rows x 27 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 19 10 442 66 174 ... 51 68 39 15 13 26 31 195
1 2021-22 Purdue Big Ten 19 17 620 104 244 ... 62 79 55 18 10 47 36 333
2 Career Purdue NaN 38 27 1062 170 418 ... 113 147 94 33 23 73 67 528
[3 rows x 27 columns], Season School Conf G GS MP FG ... TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 23 12 557 6.4 ... 5.5 3.1 1.2 1.1 2.2 2.8 18.4
1 2021-22 Purdue Big Ten 36 34 1132 7.2 ... 6.2 3.9 1.2 0.7 3.3 2.2 22.0
2 Career Purdue NaN 59 46 1689 6.9 ... 6.0 3.6 1.2 0.9 3.0 2.4 20.8
[3 rows x 25 columns]]
Web Scraping a Table Over Many Days
You're on a good track here; you need a more apt CSS or XPath selector. Using rvest, you can grab both with the same code if your selector is good enough:
library(rvest)
URL1 = "http://www.scoresandodds.com/grid_20161123.html"
URL2 = "http://www.scoresandodds.com/grid_20161125.html"
html1 <- URL1 %>% read_html()
df1 <- html1 %>% html_node('#nba ~ div table') %>% html_table()
html2 <- URL2 %>% read_html()
df2 <- html2 %>% html_node('#nba ~ div table') %>% html_table()
str(df1)
#> 'data.frame': 65 obs. of 7 variables:
#> $ Team : chr "7:05 PM EST" "701 PHOENIX SUNS" "702 ORLANDO MAGIC" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Open : chr "7:05 PM EST" "206.5" "-4.5" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Line Movements: chr "7:05 PM EST" "207.5 / 208 / 209.5" "-4 -15 / -4.5 / -4.5 -05" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Current : chr "7:05 PM EST" "210" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Moneyline : chr "7:05 PM EST" "+155" "-175" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Halftime : chr "7:05 PM EST" "109" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Scores : chr "7:05 PM EST" "92Under 210" "87final" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
str(df2)
#> 'data.frame': 75 obs. of 7 variables:
#> $ Team : chr "1:05 PM EST" "701 SAN ANTONIO SPURS" "702 BOSTON CELTICS" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Open : chr "1:05 PM EST" "-2.5" "203.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Line Movements: chr "1:05 PM EST" "-3 / -3.5 -15 / -3.5" "199 / 200 / 201" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Current : chr "1:05 PM EST" "-3.5 -05" "201.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Moneyline : chr "1:05 PM EST" "-155" "+135" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Halftime : chr "1:05 PM EST" "-4.5" "106" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Scores : chr "1:05 PM EST" "109Over 201.5" "103final" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
In this case, the CSS selector
- looks for a node with an ID of
nba
, then - looks for a
div
after that, then - selects the
table
node inside of it.
You can write the same thing in XPath, if you like, which would let you use the XML package, if you really like. If you want to up your CSS selector skills, the tutorial linked in ?rvest::html_node
is fun and efficient.
If you want to scrape a lot of similar URLs at once, you can put them in a vector and iterate over it with lapply
, or more conveniently purrr::map_df
. Scrape responsibly; it's kind to put a Sys.sleep
call in the anonymous function so as to behave more like a normal site visitor.
Related Topics
Gotham Book' Font-Family Not Working in Safari Browser and iPhone Devices
Why Bootstrap CSS Is Not Overriding in Other Project with The Same Code
How to Tell CSS: Not() Selector to Affect All Child Nodes
Make Radio Buttons Over Power Background Image
Helvetica Neue Light with @Font-Face .. Legal
Set Max Height of Adsense Responsive Ad Unit
Can't Find a "Not Equal" CSS Attribute Selector
Justify Buttons with Bootstrap V4
In HTML Table How to Force Cell Text to Truncate and Not Increase The Width of The Cell or Wrap
Is There Any Possibilty Using CSS-Variables in Sass
How to Select The 1St and Then Every 4Th Row in a HTML-Table with Nth-Child()-Selector
Ie 8. Gradient Background+Image
Center Vertically The Content of a Div ( Not by Line-Height )
How to Make The Elements Cover 100% of The Space Available in a Container Using Flexbox