Scraping HTML Tables into R Data Frames Using the Xml Package

Scraping html tables into R data frames using the XML package

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

Scraping a Table into R using XML package

Not sure about the package you tried, but here's a way to do it using rvest.

library(rvest)
raw <- read_html("https://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Wireless+Telephone+Service")
df <- raw %>% html_nodes("table") %>% html_table()
head(df)
> head(df)
[[1]]
                           X1        X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
1                             Base-line 95 96 97 98 99  0  1   2   3  04  05  06  07
2                  All Others           NA NA NA NA NA NA NA  NA  NA  70  65  68  68
3           TracFone Wireless           NA NA NA NA NA NA NA  NA  NA  NM  NM  NM  NM
4                    T-Mobile           NA NA NA NA NA NA NA  NA  NA  NM  64  69  70
5            Verizon Wireless           NA NA NA NA NA NA NA  NA  NA  68  67  69  71
6  Wireless Telephone Service           NA NA NA NA NA NA NA  NA  NA  65  63  66  68
7                        AT&T           NA NA NA NA NA NA NA  NA  NA  63  62  63  68
8               U.S. Cellular           NA NA NA NA NA NA NA  NA  NA  NM  NM  NM  NM
9           Sprint (T-Mobile)           NA NA NA NA NA NA NA  NA  NA  59  63  63  61
10      Nextel Communications           NA NA NA NA NA NA NA  NA  NA  NM  59   #    
11              AT&T Wireless           NA NA NA NA NA NA NA  NA  NA  61   #        
12                     Sprint           NA NA NA NA NA NA NA  NA  NA  59  63  63  61
   X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29                 X30
1   08  09  10  11  12  13  14  15  16  17  18  19  20  21 PreviousYear%Change
2   71  73  76  77  76  78  78  79  77  79  80  81  77  NA                -4.9
3   NM  NM  NM  NM  NM  NM  NM  77  75  77  78  78  76  NA                -2.6
4   71  71  73  70  69  68  69  70  74  73  76  76  75  NA                -1.3
5   72  74  73  72  70  73  75  71  71  74  74  74  74  NA                 0.0
6   68  69  72  71  70  72  72  70  71  73  74  75  74  NA                -1.3
7   71  67  69  66  69  70  68  70  71  72  74  74  74  NA                 0.0
8   NM  NM  NM  NM  NM  NM  NM  NM  72  74  74  74  71  NA                -4.1
9   56  63  70  72  71  71  68  65  70  73  70  69  70  NA                 1.4
10                                  NA  NA  NA  NA  NA  NA                 N/A
11                                  NA  NA  NA  NA  NA  NA                 N/A
12  56  63  70  72  71  71  68  65  70  73  70  69  NA  NA                -1.4

Scraping html tables into R data frames

As @Henry Navarro has pointed out, it is not clear which nodes, etc. you need exactly. Finding the right nodes is a time consuming task, so you need to specify which nodes you want. You can use Selectorgadget for this purpose.

In the following a quick example how you might generate the list of team websites that you will have to loop through with rvest to extract information. I think the main functionality you have been missing so far for this purpose is html_attr(), see, e.g., this answer. Of course, you will have to find the nodes on these sites to extract information on stadium, etc.

file %>% 
html_nodes("table") %>%
{ .[4]} %>% 
html_nodes("a") %>% 
html_attr("href") %>% 
{ .[grep("/startseite/verein",., fixed=T)]} %>% 
unique() %>% 
{ paste0("https://www.transfermarkt.co.uk", .) }

# [1] "https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2017"               
# [2] "https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2017"          
# [3] "https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985/saison_id/2017"        
# [4] "https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2017" 
#...

Scraping html table with images using XML R package

I was able to find the Xpath query to the image name using SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

Web Scraping a table into R

You need to extract the correct html_nodes, and then convert them into a data.frame. The code below is an example of how to go about doing something like this. I find Selector Gadget very useful for finding the right CSS selectors.

library(tidyverse)
library(rvest)

# read the html
html <- read_html('http://www.eliteprospects.com/iframe_player_stats.php?player=364033')

# function to read columns
read_col <- function(x){
  col <- html %>%  
    # CSS nodes to select by using selector gadget
    html_nodes(paste0("td:nth-child(", x, ")")) %>% 
    html_text()
  return(col)
}

# apply the function
col_list <- lapply(c(1:8, 10:15), read_col)

# collapse into matrix
mat <- do.call(cbind, col_list)

# put data into dataframe
df <- data.frame(mat[2:nrow(mat), ] %>% data.frame()) 

# assign names
names(df) <- mat[1, ] 

df

R - Extracting Tables From Websites Using XML Package

Seems as if the data is loaded via javascript. Try:

library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
      points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))

Scraping a complex HTML table into a data.frame in R

Maybe like this

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"              "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
# [5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson"    "John Jay†"       "William Cushing" "John Blair, Jr." "John Rutledge"   "James Iredell"