How to Select from Only One Table with Web::Scraper

rvest: select and scrape html table(s) after a specific (title) string

library(rvest)
library(stringr)

doc <- read_html("https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d")

# extract all the nodes that have the title (id = "tytul") or a table
# the cs selector "," is like a boolean OR. 
nodes <- doc %>% html_nodes(".tytul,table")

# loop though each node.
signal <- FALSE
my_tables <- list()
j <- 0
for (i in 1:length(nodes)) {

  # if title signal previously set and this is a table tag
  if (signal & html_name(nodes[i]) == "table") {
    cat("Match..\n")

    # get the table (data frame)
    this_table <- html_table(nodes[i], fill = TRUE, header = TRUE)[[1]]

    # append to list
    j = j + 1
    my_tables[[j]] <- this_table

    # and reset the signal so we search for the next one
    signal <- FALSE
  }

  # if the signal is clear look for matching title
  if (!signal) {
    signal <- nodes[i] %>% html_text() %>% str_detect("Tabela.+ICD 9")
  }
}
my_tables[[1]][1:5,]
my_tables[[2]][1:5,]

# > my_tables[[1]][1:5,]
# ICD 9                                                    Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 2.051       ZaĹ\u0082oĹźenie pĹ\u0082ytki sztucznej do czaszki                168            32,31                          7
# 2 1.247 Kraniotomia z usuniÄ\u0099ciem krwiaka podtwardĂłwkowego                 55            10,58                         20
# 3 2.022                       Odbarczenie zĹ\u0082amania czaszki                 43             8,27                          6
# 4 2.040                 Przeszczep kostny do koĹ\u009bci czaszki                 35             6,73                          8
# 5 1.093                        Inne aspiracje w zakresie czaszki                 33             6,35                          5
# > my_tables[[2]][1:5,]
# ICD 9                                         Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1    O35                                     SĂłd (Na)                239            45,96                          8
# 2  89.00          Porada lekarska, konsultacja, asysta                230            44,23                          9
# 3    N45                                     Potas (K)                217            41,73                          8
# 4 87.030                  TK gĹ\u0082owy bez kontrastu                214            41,15                          9
# 5  89.04 Opieka pielÄ\u0099gniarki lub poĹ\u0082oĹźnej                202            38,85                          8

Web Scraping, extract table of a page

It makes an XHR request to another resource which is used to make the table.

library(rvest)
library(dplyr)

pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")

html_nodes(pg, "table") %>%
  html_table() %>%
  .[[1]] %>%
  tbl_df() %>%
  select(1:2)
## # A tibble: 36 × 2
##        R.U.T.                                            Entidad
##         <chr>                                              <chr>
## 1  99588060-1                           ACE SEGUROS DE VIDA S.A.
## 2  76511423-3                               ALEMANA SEGUROS S.A.
## 3  96917990-3                      BANCHILE SEGUROS DE VIDA S.A.
## 4  96933770-3                          BBVA SEGUROS DE VIDA S.A.
## 5  96573600-K                              BCI SEGUROS VIDA S.A.
## 6  96656410-5                 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7  96837630-6            BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8  76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9  76477116-8                            CF SEGUROS DE VIDA S.A.
## 10 99185000-7           CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows

You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.

How to scrape all rows from a dynamic table in html using python

You can read it directly using pandas.read_html() function as a DataFrame which will do it easily for you.

import pandas as pd

def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(df)

main("http://5000best.com/websites/Games/{}/")

Sample of output:

Sample Image

CSV edit:

import pandas as pd

def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(f"Saving Page {item}")
        df.to_csv(f"page{item}.csv", index=False)

main("http://5000best.com/websites/Games/{}/")

Code updated for single DataFrame:

import pandas as pd

def main(url):
    goal = []
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        goal.append(df)
    final = pd.concat(goal)
    print(final)

main("http://5000best.com/websites/Games/{}/")

How to find a table with Web::Scraper based on cell values?

You need to use an XPath expression to look at the text content of the nodes.

This should do the trick

my $table = scraper {
  process '//table[tr[1]/th[1][normalize-space(text())="Content"]]/tr', 'rows[]' => scraper {
    process 'th', 'header' => 'TEXT';
    process 'td', 'cols[]' => 'TEXT';
  };
};

It may look complex, but it's OK if you break it down.

It selects all <tr> elements that are children of any <table> element beneath the root for which the first <th> element of the first <tr> element contains a text element equal to "Content" when normalized (leading and trailing spaces stripped).

output

---
rows:
  - cols:
      - col-1
      - col-n
    header: Content
  - cols:
      - 2012
      - 2001
    header: Date
  - cols:
      - val-1
      - val-n
    header: Banana

Scrape tables by passing multiple search requests using R

The call needs is to https: and not http:. I also removed the plyr library used just base R:

library(rvest)
fn  = rep(c('HARVEY','HARVEY'));
ln  = rep(c('BIDWELL','ADELSON'));
mydf = data.frame(fn,ln);

get_data = function(df){
  root = 'https://npiregistry.cms.hhs.gov/'
  u = paste(root,'registry/search-results-table?','first_name=', df[1], '&last_name=', 
            df[2], sep = "");
  # encode url correctly
  url  = URLencode(u);
  #print(url)
  # extract data from the right table
  data = read_html(url);
  newresult<- html_nodes(data, "table")[1] %>%html_table()
  # convert result into a data frame
  newresult<-as.data.frame(newresult)
}

mydata = apply(mydf, 1, function(x) { get_data(x)})
#mydata is a list of data frames, do.call creates a single data.frame
finalanswer<-do.call(rbind, mydata)
#finalanswer needs some clean up.

Web scraping multiple tables from a single webpage

You can apply pandas to pull those tables data easily.

import pandas as pd
df =pd.read_html('https://www.sports-reference.com/cbb/players/jaden-ivey-1.html')[0:5]
print(df)

Output:

[    Season  School     Conf   G  GS    MP   FG  ...  STL  BLK  TOV   PF   PTS  Unnamed: 27   
 SOS
0  2020-21  Purdue  Big Ten  23  12  24.2  3.9  ...  0.7  0.7  1.3  1.7  11.1          NaN  11.23
1  2021-22  Purdue  Big Ten  36  34  31.4  5.6  ...  0.9  0.6  2.6  1.8  17.3          NaN   8.23
2   Career  Purdue      NaN  59  46  28.6  4.9  ...  0.8  0.6  2.1  1.7  14.9          NaN   9.73

[3 rows x 29 columns],     Season  School     Conf   G  GS    MP   FG   FGA  ...  DRB  TRB  AST  STL  BLK  TOV   PF   PTS
0  2020-21  Purdue  Big Ten  19  10  23.3  3.5   9.2  ...  2.7  3.6  2.1  0.8  0.7  1.4  1.6  
10.3
1  2021-22  Purdue  Big Ten  19  17  32.6  5.5  12.8  ...  3.3  4.2  2.9  0.9  0.5  2.5  1.9  
17.5
2   Career  Purdue      NaN  38  27  27.9  4.5  11.0  ...  3.0  3.9  2.5  0.9  0.6  1.9  1.8  
13.9

[3 rows x 27 columns],     Season  School     Conf   G  GS    MP   FG  FGA  ...  DRB  TRB  AST  STL  BLK  TOV   PF  PTS
0  2020-21  Purdue  Big Ten  23  12   557   89  223  ...   57   76   43   17   16   31   39  256
1  2021-22  Purdue  Big Ten  36  34  1132  203  441  ...  152  176  110   33   20   94   63  624
2   Career  Purdue      NaN  59  46  1689  292  664  ...  209  252  153   50   36  125  102  880

[3 rows x 27 columns],     Season  School     Conf   G  GS    MP   FG  FGA  ...  DRB  TRB  AST  STL  BLK  TOV  PF  PTS
0  2020-21  Purdue  Big Ten  19  10   442   66  174  ...   51   68   39   15   13   26  31  195
1  2021-22  Purdue  Big Ten  19  17   620  104  244  ...   62   79   55   18   10   47  36  333
2   Career  Purdue      NaN  38  27  1062  170  418  ...  113  147   94   33   23   73  67  528

[3 rows x 27 columns],     Season  School     Conf   G  GS    MP   FG  ...  TRB  AST  STL  BLK  TOV   PF   PTS
0  2020-21  Purdue  Big Ten  23  12   557  6.4  ...  5.5  3.1  1.2  1.1  2.2  2.8  18.4       
1  2021-22  Purdue  Big Ten  36  34  1132  7.2  ...  6.2  3.9  1.2  0.7  3.3  2.2  22.0       
2   Career  Purdue      NaN  59  46  1689  6.9  ...  6.0  3.6  1.2  0.9  3.0  2.4  20.8       

[3 rows x 25 columns]]

Web Scraping a Table Over Many Days

You're on a good track here; you need a more apt CSS or XPath selector. Using rvest, you can grab both with the same code if your selector is good enough:

library(rvest)

URL1 = "http://www.scoresandodds.com/grid_20161123.html"
URL2 = "http://www.scoresandodds.com/grid_20161125.html"

html1 <- URL1 %>% read_html()
df1 <- html1 %>% html_node('#nba ~ div table') %>% html_table()

html2 <- URL2 %>% read_html()
df2 <- html2 %>% html_node('#nba ~ div table') %>% html_table()

str(df1)
#> 'data.frame':    65 obs. of  7 variables:
#>  $ Team          : chr  "7:05 PM EST" "701 PHOENIX SUNS" "702 ORLANDO MAGIC" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Open          : chr  "7:05 PM EST" "206.5" "-4.5" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Line Movements: chr  "7:05 PM EST" "207.5 / 208 / 209.5" "-4 -15 / -4.5  / -4.5 -05" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Current       : chr  "7:05 PM EST" "210" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Moneyline     : chr  "7:05 PM EST" "+155" "-175" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Halftime      : chr  "7:05 PM EST" "109" "-4" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...
#>  $ Scores        : chr  "7:05 PM EST" "92Under 210" "87final" "PHO-F-T.J. Warren-? | TV: FS-Florida, DTV: 654" ...

str(df2)
#> 'data.frame':    75 obs. of  7 variables:
#>  $ Team          : chr  "1:05 PM EST" "701 SAN ANTONIO SPURS" "702 BOSTON CELTICS" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Open          : chr  "1:05 PM EST" "-2.5" "203.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Line Movements: chr  "1:05 PM EST" "-3  / -3.5 -15 / -3.5" "199 / 200 / 201" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Current       : chr  "1:05 PM EST" "-3.5 -05" "201.5" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Moneyline     : chr  "1:05 PM EST" "-155" "+135" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Halftime      : chr  "1:05 PM EST" "-4.5" "106" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...
#>  $ Scores        : chr  "1:05 PM EST" "109Over 201.5" "103final" "TV: CSN-New England, FS-Southwest, DTV: 642, 676" ...

In this case, the CSS selector

looks for a node with an ID of nba, then
looks for a div after that, then
selects the table node inside of it.

You can write the same thing in XPath, if you like, which would let you use the XML package, if you really like. If you want to up your CSS selector skills, the tutorial linked in ?rvest::html_node is fun and efficient.

If you want to scrape a lot of similar URLs at once, you can put them in a vector and iterate over it with lapply, or more conveniently purrr::map_df. Scrape responsibly; it's kind to put a Sys.sleep call in the anonymous function so as to behave more like a normal site visitor.