Importing Wikipedia Tables in R

Importing wikipedia tables in R

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

V1 V2 V3 V4
1 County Population Land Area (sq mi) Population Density (per sq mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"

Load a table from wikipedia into R

If you don't mind using a different package, you can try the "rvest" package.

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
  • Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

    temp <- scotusURL %>% 
    html %>%
    html_nodes("table")

    html_table(temp[1]) ## Just the "legend" table
    html_table(temp[2]) ## The table you're interested in
  • Option 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").

    scotusURL %>% 
    html %>%
    html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>%
    html_table

Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.

In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

This will automatically scrape this table into your Google spreadsheet.

From there, in R you can do:

library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)

Webscraping Tables From Wikipedia in R

Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:

library(rvest)

url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"

html <- read_html(url)

county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()

head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>

How to remove reference from imported wikipedia table in R?

You can use gsub() to remove the pattern of reference parts.

library(dplyr)

dane %>%
mutate(across(.fns = ~ gsub("\\[.*?\\]", "", .)))

extract a specific table from wikipedia in R

since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:

library(dplyr)
library(rvest)

the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"

the_url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>%
html_table(fill=TRUE)

Scraping Wikipedia HTML table with images, text, and blank cells with R

You can try the following

require(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"
doc <- read_html(url)
col_names <- doc %>% html_nodes("#mw-content-text > table > tr:nth-child(1) > th") %>% html_text()
tbody <- doc %>% html_nodes("#mw-content-text > table > tr:not(:first-child)")

extract_tr <- function(tr){
scope <- tr %>% html_children()
c(scope[1:2] %>% html_text(),
scope[3:length(scope)] %>% html_node("img") %>% html_attr("alt"))
}

res <- tbody %>% sapply(extract_tr)
res <- as.data.frame(t(res), stringsAsFactors = FALSE)
colnames(res) <- col_names

Now you have the raw-table. I leave the parsing of the columns to integer and the column-names to you

How can I extract a particular element of Wikipedia table in R using rvest?

Using xpath you could first get the infobox by its class name infobox and then all links via their tag name a.

library("rvest")

url <- "https://en.wikipedia.org/wiki/New_York_City"
infobox <- url %>%
read_html() %>%
html_nodes(xpath='//table[contains(@class, "infobox")]//a')

print(infobox)

Output

{xml_nodeset (81)}
[1] <a href="/wiki/City_(New_York)" class="mw-redirect" title="City (New York)">City</a>
[2] <a href="/wiki/File:NYC_Montage_2014_4_-_Jleon.jpg" class="image" title="Clockwise, from top: Midtow ...
[3] <a href="/wiki/Midtown_Manhattan" title="Midtown Manhattan">Midtown Manhattan</a>
[4] <a href="/wiki/Times_Square" title="Times Square">Times Square</a>
[5] <a href="/wiki/Unisphere" title="Unisphere">Unisphere</a>
...

Scraping html tables into R data frames using the XML package

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]


Related Topics



Leave a reply



Submit