Scraping HTML Tables into R Data Frames Using the Xml Package

Scraping html tables into R data frames using the XML package

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

Scraping a Table into R using XML package

Not sure about the package you tried, but here's a way to do it using rvest.

library(rvest)
raw <- read_html("https://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Wireless+Telephone+Service")
df <- raw %>% html_nodes("table") %>% html_table()
head(df)
> head(df)
[[1]]
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
1 Base-line 95 96 97 98 99 0 1 2 3 04 05 06 07
2 All Others NA NA NA NA NA NA NA NA NA 70 65 68 68
3 TracFone Wireless NA NA NA NA NA NA NA NA NA NM NM NM NM
4 T-Mobile NA NA NA NA NA NA NA NA NA NM 64 69 70
5 Verizon Wireless NA NA NA NA NA NA NA NA NA 68 67 69 71
6 Wireless Telephone Service NA NA NA NA NA NA NA NA NA 65 63 66 68
7 AT&T NA NA NA NA NA NA NA NA NA 63 62 63 68
8 U.S. Cellular NA NA NA NA NA NA NA NA NA NM NM NM NM
9 Sprint (T-Mobile) NA NA NA NA NA NA NA NA NA 59 63 63 61
10 Nextel Communications NA NA NA NA NA NA NA NA NA NM 59 #
11 AT&T Wireless NA NA NA NA NA NA NA NA NA 61 #
12 Sprint NA NA NA NA NA NA NA NA NA 59 63 63 61
X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1 08 09 10 11 12 13 14 15 16 17 18 19 20 21 PreviousYear%Change
2 71 73 76 77 76 78 78 79 77 79 80 81 77 NA -4.9
3 NM NM NM NM NM NM NM 77 75 77 78 78 76 NA -2.6
4 71 71 73 70 69 68 69 70 74 73 76 76 75 NA -1.3
5 72 74 73 72 70 73 75 71 71 74 74 74 74 NA 0.0
6 68 69 72 71 70 72 72 70 71 73 74 75 74 NA -1.3
7 71 67 69 66 69 70 68 70 71 72 74 74 74 NA 0.0
8 NM NM NM NM NM NM NM NM 72 74 74 74 71 NA -4.1
9 56 63 70 72 71 71 68 65 70 73 70 69 70 NA 1.4
10 NA NA NA NA NA NA N/A
11 NA NA NA NA NA NA N/A
12 56 63 70 72 71 71 68 65 70 73 70 69 NA NA -1.4

Scraping html tables into R data frames

As @Henry Navarro has pointed out, it is not clear which nodes, etc. you need exactly. Finding the right nodes is a time consuming task, so you need to specify which nodes you want. You can use Selectorgadget for this purpose.

In the following a quick example how you might generate the list of team websites that you will have to loop through with rvest to extract information. I think the main functionality you have been missing so far for this purpose is html_attr(), see, e.g., this answer. Of course, you will have to find the nodes on these sites to extract information on stadium, etc.

file %>% 
html_nodes("table") %>%
{ .[4]} %>%
html_nodes("a") %>%
html_attr("href") %>%
{ .[grep("/startseite/verein",., fixed=T)]} %>%
unique() %>%
{ paste0("https://www.transfermarkt.co.uk", .) }

# [1] "https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2017"
# [2] "https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2017"
# [3] "https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985/saison_id/2017"
# [4] "https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2017"
#...

Scraping html table with images using XML R package

I was able to find the Xpath query to the image name using SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

Web Scraping a table into R

You need to extract the correct html_nodes, and then convert them into a data.frame. The code below is an example of how to go about doing something like this. I find Selector Gadget very useful for finding the right CSS selectors.

library(tidyverse)
library(rvest)

# read the html
html <- read_html('http://www.eliteprospects.com/iframe_player_stats.php?player=364033')

# function to read columns
read_col <- function(x){
col <- html %>%
# CSS nodes to select by using selector gadget
html_nodes(paste0("td:nth-child(", x, ")")) %>%
html_text()
return(col)
}

# apply the function
col_list <- lapply(c(1:8, 10:15), read_col)

# collapse into matrix
mat <- do.call(cbind, col_list)

# put data into dataframe
df <- data.frame(mat[2:nrow(mat), ] %>% data.frame())

# assign names
names(df) <- mat[1, ]

df

R - Extracting Tables From Websites Using XML Package

Seems as if the data is loaded via javascript. Try:

library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))

Scraping a complex HTML table into a data.frame in R

Maybe like this

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
# [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"


Related Topics



Leave a reply



Submit