How to Select from Only One Table with Web::Scraper

rvest: select and scrape html table(s) after a specific (title) string

library(rvest)
library(stringr)

doc <- read_html("https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d")

# extract all the nodes that have the title (id = "tytul") or a table
# the cs selector "," is like a boolean OR.
nodes <- doc %>% html_nodes(".tytul,table")

# loop though each node.
signal <- FALSE
my_tables <- list()
j <- 0
for (i in 1:length(nodes)) {

# if title signal previously set and this is a table tag
if (signal & html_name(nodes[i]) == "table") {
cat("Match..\n")

# get the table (data frame)
this_table <- html_table(nodes[i], fill = TRUE, header = TRUE)[[1]]

# append to list
j = j + 1
my_tables[[j]] <- this_table

# and reset the signal so we search for the next one
signal <- FALSE
}

# if the signal is clear look for matching title
if (!signal) {
signal <- nodes[i] %>% html_text() %>% str_detect("Tabela.+ICD 9")
}
}
my_tables[[1]][1:5,]
my_tables[[2]][1:5,]

# > my_tables[[1]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 2.051 ZaĹ\u0082oĹźenie pĹ\u0082ytki sztucznej do czaszki 168 32,31 7
# 2 1.247 Kraniotomia z usuniÄ\u0099ciem krwiaka podtwardĂłwkowego 55 10,58 20
# 3 2.022 Odbarczenie zĹ\u0082amania czaszki 43 8,27 6
# 4 2.040 Przeszczep kostny do koĹ\u009bci czaszki 35 6,73 8
# 5 1.093 Inne aspiracje w zakresie czaszki 33 6,35 5
# > my_tables[[2]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 O35 SĂłd (Na) 239 45,96 8
# 2 89.00 Porada lekarska, konsultacja, asysta 230 44,23 9
# 3 N45 Potas (K) 217 41,73 8
# 4 87.030 TK gĹ\u0082owy bez kontrastu 214 41,15 9
# 5 89.04 Opieka pielÄ\u0099gniarki lub poĹ\u0082oĹźnej 202 38,85 8

Web Scraping, extract table of a page

It makes an XHR request to another resource which is used to make the table.

library(rvest)
library(dplyr)

pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")

html_nodes(pg, "table") %>%
html_table() %>%
.[[1]] %>%
tbl_df() %>%
select(1:2)
## # A tibble: 36 × 2
## R.U.T. Entidad
## <chr> <chr>
## 1 99588060-1 ACE SEGUROS DE VIDA S.A.
## 2 76511423-3 ALEMANA SEGUROS S.A.
## 3 96917990-3 BANCHILE SEGUROS DE VIDA S.A.
## 4 96933770-3 BBVA SEGUROS DE VIDA S.A.
## 5 96573600-K BCI SEGUROS VIDA S.A.
## 6 96656410-5 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7 96837630-6 BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8 76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9 76477116-8 CF SEGUROS DE VIDA S.A.
## 10 99185000-7 CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows

You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.

How to scrape all rows from a dynamic table in html using python

You can read it directly using pandas.read_html() function as a DataFrame which will do it easily for you.

import pandas as pd

def main(url):
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
print(df)

main("http://5000best.com/websites/Games/{}/")

Sample of output:

Sample Image

CSV edit:

import pandas as pd

def main(url):
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
print(f"Saving Page {item}")
df.to_csv(f"page{item}.csv", index=False)

main("http://5000best.com/websites/Games/{}/")

Code updated for single DataFrame:

import pandas as pd

def main(url):
goal = []
for item in range(1, 4):
df = pd.read_html(url.format(item))[1]
goal.append(df)
final = pd.concat(goal)
print(final)

main("http://5000best.com/websites/Games/{}/")

How to find a table with Web::Scraper based on cell values?

You need to use an XPath expression to look at the text content of the nodes.

This should do the trick

my $table = scraper {
process '//table[tr[1]/th[1][normalize-space(text())="Content"]]/tr', 'rows[]' => scraper {
process 'th', 'header' => 'TEXT';
process 'td', 'cols[]' => 'TEXT';
};
};

It may look complex, but it's OK if you break it down.

It selects all <tr> elements that are children of any <table> element beneath the root for which the first <th> element of the first <tr> element contains a text element equal to "Content" when normalized (leading and trailing spaces stripped).

output

---
rows:
- cols:
- col-1
- col-n
header: Content
- cols:
- 2012
- 2001
header: Date
- cols:
- val-1
- val-n
header: Banana

Scrape tables by passing multiple search requests using R

The call needs is to https: and not http:. I also removed the plyr library used just base R:

library(rvest)
fn = rep(c('HARVEY','HARVEY'));
ln = rep(c('BIDWELL','ADELSON'));
mydf = data.frame(fn,ln);

get_data = function(df){
root = 'https://npiregistry.cms.hhs.gov/'
u = paste(root,'registry/search-results-table?','first_name=', df[1], '&last_name=',
df[2], sep = "");
# encode url correctly
url = URLencode(u);
#print(url)
# extract data from the right table
data = read_html(url);
newresult<- html_nodes(data, "table")[1] %>%html_table()
# convert result into a data frame
newresult<-as.data.frame(newresult)
}

mydata = apply(mydf, 1, function(x) { get_data(x)})
#mydata is a list of data frames, do.call creates a single data.frame
finalanswer<-do.call(rbind, mydata)
#finalanswer needs some clean up.

Web scraping multiple tables from a single webpage

You can apply pandas to pull those tables data easily.

import pandas as pd
df =pd.read_html('https://www.sports-reference.com/cbb/players/jaden-ivey-1.html')[0:5]
print(df)

Output:

[    Season  School     Conf   G  GS    MP   FG  ...  STL  BLK  TOV   PF   PTS  Unnamed: 27   
SOS
0 2020-21 Purdue Big Ten 23 12 24.2 3.9 ... 0.7 0.7 1.3 1.7 11.1 NaN 11.23
1 2021-22 Purdue Big Ten 36 34 31.4 5.6 ... 0.9 0.6 2.6 1.8 17.3 NaN 8.23
2 Career Purdue NaN 59 46 28.6 4.9 ... 0.8 0.6 2.1 1.7 14.9 NaN 9.73

[3 rows x 29 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 19 10 23.3 3.5 9.2 ... 2.7 3.6 2.1 0.8 0.7 1.4 1.6
10.3
1 2021-22 Purdue Big Ten 19 17 32.6 5.5 12.8 ... 3.3 4.2 2.9 0.9 0.5 2.5 1.9
17.5
2 Career Purdue NaN 38 27 27.9 4.5 11.0 ... 3.0 3.9 2.5 0.9 0.6 1.9 1.8
13.9

[3 rows x 27 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 23 12 557 89 223 ... 57 76 43 17 16 31 39 256
1 2021-22 Purdue Big Ten 36 34 1132 203 441 ... 152 176 110 33 20 94 63 624
2 Career Purdue NaN 59 46 1689 292 664 ... 209 252 153 50 36 125 102 880

[3 rows x 27 columns], Season School Conf G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 19 10 442 66 174 ... 51 68 39 15 13 26 31 195
1 2021-22 Purdue Big Ten 19 17 620 104 244 ... 62 79 55 18 10 47 36 333
2 Career Purdue NaN 38 27 1062 170 418 ... 113 147 94 33 23 73 67 528

[3 rows x 27 columns], Season School Conf G GS MP FG ... TRB AST STL BLK TOV PF PTS
0 2020-21 Purdue Big Ten 23 12 557 6.4 ... 5.5 3.1 1.2 1.1 2.2 2.8 18.4
1 2021-22 Purdue Big Ten 36 34 1132 7.2 ... 6.2 3.9 1.2 0.7 3.3 2.2 22.0
2 Career Purdue NaN 59 46 1689 6.9 ... 6.0 3.6 1.2 0.9 3.0 2.4 20.8

[3 rows x 25 columns]]

Web Scraping a Table Over Many Days

You're on a good track here; you need a more apt CSS or XPath selector. Using rvest, you can grab both with the same code if your selector is good enough:

library(rvest)

URL1 = "http://www.scoresandodds.com/grid_20161123.html"
URL2 = "http://www.scoresandodds.com/grid_20161125.html"

html1 <- URL1 %>% read_html()
df1 <- html1 %>% html_node('#nba ~ div table') %>% html_table()

html2 <- URL2 %>% read_html()
df2 <- html2 %>% html_node('#nba ~ div table') %>% html_table()

str(df1)
#> 'data.frame': 65 obs. of 7 variables:
#> $ Team : chr "7:05 PM EST" "701 PHOENIX SUNS" "702 ORLANDO MAGIC" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Open : chr "7:05 PM EST" "206.5" "-4.5" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Line Movements: chr "7:05 PM EST" "207.5 / 208 / 209.5" "-4 -15 / -4.5 / -4.5 -05" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Current : chr "7:05 PM EST" "210" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Moneyline : chr "7:05 PM EST" "+155" "-175" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Halftime : chr "7:05 PM EST" "109" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#> $ Scores : chr "7:05 PM EST" "92Under 210" "87final" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...

str(df2)
#> 'data.frame': 75 obs. of 7 variables:
#> $ Team : chr "1:05 PM EST" "701 SAN ANTONIO SPURS" "702 BOSTON CELTICS" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Open : chr "1:05 PM EST" "-2.5" "203.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Line Movements: chr "1:05 PM EST" "-3 / -3.5 -15 / -3.5" "199 / 200 / 201" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Current : chr "1:05 PM EST" "-3.5 -05" "201.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Moneyline : chr "1:05 PM EST" "-155" "+135" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Halftime : chr "1:05 PM EST" "-4.5" "106" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#> $ Scores : chr "1:05 PM EST" "109Over 201.5" "103final" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...

In this case, the CSS selector

  • looks for a node with an ID of nba, then
  • looks for a div after that, then
  • selects the table node inside of it.

You can write the same thing in XPath, if you like, which would let you use the XML package, if you really like. If you want to up your CSS selector skills, the tutorial linked in ?rvest::html_node is fun and efficient.

If you want to scrape a lot of similar URLs at once, you can put them in a vector and iterate over it with lapply, or more conveniently purrr::map_df. Scrape responsibly; it's kind to put a Sys.sleep call in the anonymous function so as to behave more like a normal site visitor.



Related Topics



Leave a reply



Submit