Importing wikipedia tables in R
The function readHTMLTable
in package XML
is ideal for this.
Try the following:
library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
doc[[6]]
V1 V2 V3 V4
1 County Population Land Area (sq mi) Population Density (per sq mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3
readHTMLTable
returns a list of data.frame
s for each element of the HTML page. You can use names
to get information about each element:
> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"
Load a table from wikipedia into R
If you don't mind using a different package, you can try the "rvest" package.
library(rvest)
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
Option 1: Grab the tables from the page and use the
html_table
function to extract the tables you're interested in.temp <- scotusURL %>%
html %>%
html_nodes("table")
html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## The table you're interested inOption 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").
scotusURL %>%
html %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>%
html_table
Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.
In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:
=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)
This will automatically scrape this table into your Google spreadsheet.
From there, in R you can do:
library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)
Webscraping Tables From Wikipedia in R
Making use of the rvest
package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable")
and then extracting the table via html_table()
like so:
library(rvest)
url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"
html <- read_html(url)
county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()
head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>
How to remove reference from imported wikipedia table in R?
You can use gsub()
to remove the pattern of reference parts.
library(dplyr)
dane %>%
mutate(across(.fns = ~ gsub("\\[.*?\\]", "", .)))
extract a specific table from wikipedia in R
since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:
library(dplyr)
library(rvest)
the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"
the_url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>%
html_table(fill=TRUE)
Scraping Wikipedia HTML table with images, text, and blank cells with R
You can try the following
require(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"
doc <- read_html(url)
col_names <- doc %>% html_nodes("#mw-content-text > table > tr:nth-child(1) > th") %>% html_text()
tbody <- doc %>% html_nodes("#mw-content-text > table > tr:not(:first-child)")
extract_tr <- function(tr){
scope <- tr %>% html_children()
c(scope[1:2] %>% html_text(),
scope[3:length(scope)] %>% html_node("img") %>% html_attr("alt"))
}
res <- tbody %>% sapply(extract_tr)
res <- as.data.frame(t(res), stringsAsFactors = FALSE)
colnames(res) <- col_names
Now you have the raw-table. I leave the parsing of the columns to integer and the column-names to you
How can I extract a particular element of Wikipedia table in R using rvest?
Using xpath
you could first get the infobox by its class name infobox
and then all links via their tag name a
.
library("rvest")
url <- "https://en.wikipedia.org/wiki/New_York_City"
infobox <- url %>%
read_html() %>%
html_nodes(xpath='//table[contains(@class, "infobox")]//a')
print(infobox)
Output
{xml_nodeset (81)}
[1] <a href="/wiki/City_(New_York)" class="mw-redirect" title="City (New York)">City</a>
[2] <a href="/wiki/File:NYC_Montage_2014_4_-_Jleon.jpg" class="image" title="Clockwise, from top: Midtow ...
[3] <a href="/wiki/Midtown_Manhattan" title="Midtown Manhattan">Midtown Manhattan</a>
[4] <a href="/wiki/Times_Square" title="Times Square">Times Square</a>
[5] <a href="/wiki/Unisphere" title="Unisphere">Unisphere</a>
...
Scraping html tables into R data frames using the XML package
…or a shorter try:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page
tables[[which.max(n.rows)]]
Related Topics
How Does One Turn Contour Lines into Filled Contours
How to Pass Data Between Functions in a Shiny App
Names' Attribute Must Be the Same Length as the Vector
Questions About Set.Seed() in R
Getting a Slot's Value of S4 Objects
Specifying the Scale for the Density in Ggplot2's Stat_Density2D
How to Print the Structure of an R Object to the Console
A^K for Matrix Multiplication in R
Dealing with Readlines() Function in R
Reproduce a 'The Economist' Chart with Dual Axis
Remove White Space Between Plots and Table in Grid.Arrange
How to Prevent Functions Polluting Global Namespace
Creating a Function in R with Variable Number of Arguments,
Changing Tick Intervals When X Axis Values Are Dates
Test for Na and Select Values Based on Result
Set Upper Limit in Ggplot to Include Label Greater Than the Maximum Value