Importing Wikipedia Tables in R

Importing wikipedia tables in R

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

Load a table from wikipedia into R

If you don't mind using a different package, you can try the "rvest" package.

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

temp <- scotusURL %>% 
  html %>%
  html_nodes("table")

html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## The table you're interested in

Option 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").
```
scotusURL %>% 
  html %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
  html_table
```

Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.

In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

This will automatically scrape this table into your Google spreadsheet.

From there, in R you can do:

library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)

Webscraping Tables From Wikipedia in R

Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:

library(rvest)

url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"

html <- read_html(url)

county_table <- html %>% 
  html_element("table.wikitable.sortable") %>% 
  html_table()

head(county_table)
#> # A tibble: 6 x 14
#>   County  `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#>   <chr>   <chr>             <chr>             <chr>             <chr>           
#> 1 County  #                 %                 #                 %               
#> 2 Anders… 3,242             62.37%            1,199             23.07%          
#> 3 Andrews 816               85.27%            101               10.55%          
#> 4 Angeli… 4,377             69.05%            1,000             15.78%          
#> 5 Aransas 418               61.02%            235               34.31%          
#> 6 Archer  1,599             86.20%            191               10.30%          
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> #   Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> #   Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> #   Various candidatesOther parties <chr>,
#> #   Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> #   Total votes cast[11] <chr>

How to remove reference from imported wikipedia table in R?

You can use gsub() to remove the pattern of reference parts.

library(dplyr)

dane %>%
  mutate(across(.fns = ~ gsub("\\[.*?\\]", "", .)))

extract a specific table from wikipedia in R

since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:

library(dplyr)
library(rvest)

the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"

the_url %>%
  read_html() %>% 
  html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>% 
  html_table(fill=TRUE)

Scraping Wikipedia HTML table with images, text, and blank cells with R

You can try the following

require(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"
doc <- read_html(url)
col_names <- doc %>% html_nodes("#mw-content-text > table > tr:nth-child(1) > th") %>% html_text()
tbody <- doc %>% html_nodes("#mw-content-text > table > tr:not(:first-child)")

extract_tr <- function(tr){
  scope <- tr %>% html_children()
  c(scope[1:2] %>% html_text(),
    scope[3:length(scope)] %>% html_node("img") %>% html_attr("alt"))
}

res <- tbody %>% sapply(extract_tr)
res <- as.data.frame(t(res), stringsAsFactors = FALSE)
colnames(res) <- col_names

Now you have the raw-table. I leave the parsing of the columns to integer and the column-names to you

How can I extract a particular element of Wikipedia table in R using rvest?

Using xpath you could first get the infobox by its class name infobox and then all links via their tag name a.

library("rvest")

url <- "https://en.wikipedia.org/wiki/New_York_City"
infobox <- url %>%
  read_html() %>%
  html_nodes(xpath='//table[contains(@class, "infobox")]//a')

print(infobox)

Output

{xml_nodeset (81)}
 [1] <a href="/wiki/City_(New_York)" class="mw-redirect" title="City (New York)">City</a>
 [2] <a href="/wiki/File:NYC_Montage_2014_4_-_Jleon.jpg" class="image" title="Clockwise, from top: Midtow ...
 [3] <a href="/wiki/Midtown_Manhattan" title="Midtown Manhattan">Midtown Manhattan</a>
 [4] <a href="/wiki/Times_Square" title="Times Square">Times Square</a>
 [5] <a href="/wiki/Unisphere" title="Unisphere">Unisphere</a>
...

Scraping html tables into R data frames using the XML package

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

Importing Wikipedia Tables in R