How to Scrape a Table with Rvest and Xpath

How to scrape a table with rvest and xpath?

That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.

So you can do something like

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="column"]')

valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="data lastcolumn"]')

Or even

url %>%
read_html() %>%
html_nodes(xpath='//*[@class="section"]')

To get you most of the way there.

Please also read their terms of use - particularly 3.4.

Scrape data table using xpath in R

The data is stored as JSON. Here is a method to download and process that file.

library(httr)

#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"

#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"

#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats

R - How to scrape value from a table cell using xpath?

It's inside the comment, right ?

url ='https://www.baseball-reference.com/players/b/brownro02.shtml'
library(rvest)
tab = read_html(url) %>%
html_nodes(xpath = '//*[@id="all_batting_value"]//comment()') %>%
html_text() %>% read_html() %>%
html_table() %>% as.data.frame()

tab
Year Age Tm Lg G PA Rbat Rbaser Rdp Rfield Rpos RAA WAA Rrep RAR WAR waaWL. X162WL. oWAR dWAR oRAR Salary Pos
1 1999 23 CHC NL 33 70 -4 0 0 -3 0 -8 -0.8 2 -5 -0.5 0.478 0.495 -0.3 -0.3 -3 7/89
2 2000 24 CHC NL 45 98 4 0 0 0 -1 3 0.3 3 6 0.6 0.507 0.502 0.6 -0.2 7 $210,000 7/98
3 2001 25 CHC NL 39 92 2 0 0 0 -1 0 0.0 3 3 0.3 0.500 0.500 0.3 -0.2 3 $230,000 7/D98
4 2002 26 CHC NL 111 231 -11 -1 0 -3 -2 -16 -1.7 7 -9 -1.0 0.485 0.490 -0.7 -0.6 -6 $255,000 78/9D
5 4 Seasons 4 Seasons 4 Seasons 228 491 -9 -1 0 -6 -4 -21 -2.2 15 -5 -0.8 0.491 0.495 -0.1 -1.2 1 $695,000
Awards
1 NA
2 NA
3 NA
4 NA
5 NA

Scraping table with rvest and xpath for location mapping

I am new to R but something like the following where you define a function to retrieve the row info as a dataframe from a given url. Loop over how ever many pages you want calling the function and merging returned dfs into one big df. As nodeLists are not always the same length e.g. not every listing has a telephone number, you need to test for whether element is present in a loop over the rows. I use the method in the answer by alistaire (+ to him)

I am using css selectors rather than xpath. You can read about them here.

Given the # of possible pages I would look into using an http session. You get the efficiency of re-using a connection. I use them in other languages; from a quick google it seems R provides this, for example, with html_session.

I would welcome suggestions for improvement and any edits for correcting indentation. I'm learning as I go.

library(rvest)
library(magrittr)
library(purrr)

url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page="

get_listings <- function(url){
df <- read_html(url) %>%
html_nodes('.views-row') %>%
map_df(~list(
title = html_node(.x, '.service-card__title a')%>% html_text(),
location = trimws(gsub('\n', ' ',html_text(html_node(.x, '.service-card__address')))) %>%
{if(length(.) == 0) NA else .},
telephone = html_node(.x, '.service-card__phone') %>% html_text() %>%
{if(length(.) == 0) NA else .}
)
)
return(df)
}

pages_to_loop = 2

for(i in seq(1, pages_to_loop)){
new_url <- paste0(url, i, sep= '')
if(i==1){
df <- get_listings(new_url)
} else {
new_df <- get_listings(new_url)
df <- rbind(df, new_df)
}
}

How can I scrape a table from a website in R

This is a dynamic page, with the table generated by Javascript.
rvest alone will not suffice. Nonetheless, you could get the source content from the JSON API.

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <-
read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <-
html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
"https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
str_apikey,
"&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <-
json_obj$observations %>% as_tibble() %>%
mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>%
with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

Results: A nice table

# A tibble: 34 x 8
valid_time_gmt temp dewPt rh wdir_cardinal gust pressure precip_hrly
<dttm> <int> <int> <int> <chr> <lgl> <dbl> <dbl>
1 2010-12-31 23:52:00 38 NA 79 CALM NA 30.1 NA
2 2011-01-01 00:52:00 35 31 85 CALM NA 30.1 NA
3 2011-01-01 01:52:00 36 31 82 CALM NA 30.1 NA
4 2011-01-01 02:52:00 37 31 79 CALM NA 30.1 NA
5 2011-01-01 03:52:00 36 30 79 CALM NA 30.1 NA
6 2011-01-01 04:52:00 37 30 76 NNE NA 30.1 NA
7 2011-01-01 05:52:00 36 30 79 CALM NA 30.1 NA
8 2011-01-01 06:52:00 34 30 85 CALM NA 30.1 NA
9 2011-01-01 07:52:00 37 31 79 CALM NA 30.1 NA
10 2011-01-01 08:52:00 44 38 79 CALM NA 30.1 NA
# ... with 24 more rows

scrape a table with rvest in R that has mismatch table heading

Here's one solution. An explanation follows.

library(rvest)
library(tidyverse)

read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1)

Output of glimpse():

Observations: 25
Variables: 16
$ Rank <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…

Explanation
From ?html_table docs:

html_table currently makes a few assumptions:

  • No cells span multiple rows
  • Headers are in the first row

Part of your problem is solved by setting header = TRUE in html_table().

Another part of the problem is that the header cells span two rows, which html_table() does not expect.

Assuming you don't want to lose the information in either header row, you can:

  1. Use simplify and first to pull out the data frame from the list you get from html_table
  2. Use setNames to merge the two header rows (which are now the data frame columns and the first row)
  3. Remove the first row (now redundant) with slice

Scraping similarly named tables using rvest

You can add table to your XPath expression and (). Code could be :

library(rvest)
URL <- "https://fbref.com/en/squads/822bd0ba/Liverpool"
WS <- read_html(URL)

results=list()
i=1

for (tables in 1:length(html_nodes(x = WS,xpath = "//table[starts-with(@id,'ks_sched_')]"))) {
path=paste0('(//table[starts-with(@id,"ks_sched_")])[',i,']')
results[[i]] <- WS %>% html_nodes(xpath = path) %>% html_table() %>% data.frame()
i=i+1
}

We use a for loop, get the number of tables with length, generate a new XPath each time with paste0 and store the results in a list.

Output : list of 7 dataframes

Dataframes



Related Topics



Leave a reply



Submit