R: Extracting "Clean" Utf-8 Text from a Web Page Scraped with Rcurl

R: extracting clean UTF-8 text from a web page scraped with RCurl

I seem to have found an answer and nobody else has yet posted one, so here goes.

Earlier @kohske commented that the code worked for him once the Encoding() call was removed. That got me thinking that he probably has a Japanese locale, which in turn suggested that there was a locale issue on my machine that somehow affects R in some way - even if Perl avoids the problem. I recalibrated my search and found this question on sourcing a UTF-8 file in which the original poster had run into a similar problem. The answer involved switching the locale. I experimented and found that switching my locale to Japanese seems to solve the problem, as this screenshot shows:

Output from updated R code

Updated R code follows.

require(RCurl)
require(XML)

links <- list()
links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"
links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"

print(Sys.getlocale(category = "LC_CTYPE"))
original_ctype <- Sys.getlocale(category = "LC_CTYPE")
Sys.setlocale("LC_CTYPE","japanese")

txt <- getURL(links, .encoding = "UTF-8")

write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
Sys.setlocale("LC_CTYPE", original_ctype)

So we have to programmatically mess around with the locale. Frankly I'm a bit embarassed that we apparently need such a kludge for R on Windows in the year 2012. As I note above, Perl on the same version of Windows and in the same locale gets round the issue somehow, without requiring me to change my system settings.

The output of the updated R code above is HTML, of course. For those interested, the following code succeeds fairly well in stripping out the HTML and saving raw text, although the result needs quite a lot of tidying up.

require(RCurl)
require(XML)

links <- list()
links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"
links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"

print(Sys.getlocale(category = "LC_CTYPE"))
original_ctype <- Sys.getlocale(category = "LC_CTYPE")
Sys.setlocale("LC_CTYPE","japanese")

txt <- getURL(links, .encoding = "UTF-8")
myhtml <- htmlTreeParse(txt, useInternal = TRUE)
cleantxt <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)

write.table(cleantxt, "c:/geturl_r.txt", col.names = FALSE, quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
Sys.setlocale("LC_CTYPE", original_ctype)

Is there a simple way in R to extract only the text elements of an HTML page?

I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl)
library(RTidyHTML)
library(XML)

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r" 
doc.raw <- getURL(u)
doc <- tidyHTML(doc.raw)
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
cat(unlist(txt))

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient) 
u <- "http://stackoverflow.com/questions/tagged?tagnames=r"
ie <- COMCreate("InternetExplorer.Application") 
ie$Navigate(u)
txt <- list()
txt[[u]] <- ie[["document"]][["body"]][["innerText"]] 
ie$Quit() 
print(txt)

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.

Use RCurl to bypass disclaimer page then do the web scraping

As I mention in my comment, the solution to your problem will totally depend on the implementation of the "disclaimer page." It looks like the previous solution used cURL options defined in more detail here. Basically, what it's instructing cURL to do is to provide a fake cookies file (named "nosuchfile") and then followed the header redirect given by the site you were trying to access. Apparently that site was setup in such a way that if a visitor claimed not to have the proper cookies, then it would immediately redirect the visitor past the disclaimer page.

You didn't happen to create a file named "nosuchfile" in your working directory, did you? If not, it sounds like the target site changed the way its disclaimer page operates. If that's the case, there's really no help we can provide unless we have the actual page you're trying to access to diagnose.

In the example you reference in your question, they're using Javascript to move past the disclaimer, which could be tricky to get past.

For the example you mention, however...

Open it in Chrome (or Firefox with Firebug)
Right click on some blank space in the page and select "Inspect Element"
Click the Network tab
If there's content there, click the "Clear" button at the bottom to empty out the page.
Accept the license agreement
Watch all of the traffic that comes across the network. In my case, the top result was the interesting one. If you click it, you can preview it to verify that it is, indeed, an HTML document. If you click on the "Headers" tab under that item, it will show you the "Request URL". In my case, that was: http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundPriceDividend&pri_fund_code=U42360&data_selection=0&keyword=U42360&start_day=30&start_month=03&start_year=2012&end_day=18&end_month=04&end_year=2012&data_selection2=0

You can access that URL directly without having to accept any license agreement, either by hand or from cURL.

Note that if you've already accepted the agreement, this site stores a cookie stating such which will need to be deleted in order to get back to the license agreement page. You can do this by clicking the "Resources" tab, then going to "Cookies" and deleting each one, then refreshing the URL you posted above.

How to extract all visible text from a webpage using R

As the page has a good deal of javascript, a combination of Rselenium, rvest, and htm2txt is helpful. The function htm2txt::htm2txt() will take care (i.e. parse out or remove) of lots of javascript formatting snippets that would be difficult to exclude using plain rvest.

library(RSelenium)
library(rvest)
library(htm2txt)
library(tidyverse)

rD <- rsDriver(browser="firefox", port=4545L, verbose=TRUE)
remDr <- rD[["client"]]

remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")

captured_text <- 
  remDr$getPageSource()[[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_node(xpath = "//body") %>% 
  as.character() %>% 
  htm2txt::htm2txt()

captured_text
[1] "Skip to content\n\nMenuMenu\n\nThe following navigation element is controlled via arrow keys followed by tab\n\nMy Account\nMy Account\n\nPersonal Accounts\n\n• Account Summary\n\n• View Statement\n\n• Manage Account\n\n• Make a Payment\n\n• Manage Pre-Authorized Payment\n\n• Add Someone to Your Account\n\nBusiness Accounts\n\n• Business Account Summary\n\n• American Express @Work\n\n• Merchant Services\n\nOnline Services\n\n• Register for Online Services\n\n• Activate Your Card\n\n• American Express App\n\n• Manage Account Alerts\n\n• Sign Up for Email Offers\n\n• Online-Only Statements\n\nHelp & Support\n\n• Forgot User ID or Password?\n\n• Support 24/7\n\n• Welcome Centre\n\n• Ways to Pay\n\n• Security Centre\n\nCanadaChange Country\n\nEnglish\n\n• Français\n\nCards\nCards\n\nPersonal Cards\n\n• View All Cards\n\n• Cash Back Credit Cards\n\n• Flexible Rewards Cards\n\n• No Annual Fee Cards\n\n• Co-Branded Cards\n\n• Travel Cards\n\nFeatured Cards\n\n• The American Express Aeroplan Reserve Card\n\n• The Cobalt Card\n\n• The SimplyCash Preferred Card\n\n• The Choice Card..."

How to isolate a single element from a scraped web page in R

These questions are very helpful when dealing with web scraping and XML in R:

Scraping html tables into R data frames using the XML package
How to transform XML data into a data.frame?

With regards to your particular example, while I'm not sure what you want the output to look like, this gets the "goals scored" as a character vector:

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)

The xpathSApply function gets all the values that match the given criteria, and returns them as a vector. Note how I'm looking for a div with class='cont'. Using class values is frequently a good way to parse an HTML document because they are good markers.

You can clean this up however you want:

> gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
[1] "Philipp LAHM (GER) 6'"    "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
[6] "Torsten FRINGS (GER) 87'"

How to source() .R file saved using UTF-8 encoding?

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

Sample Image

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

how to build a webscraper in R using readLines and grep?

You will face the problem of cleaning of the scraped page if you really insist on using grep and readLines, but this can be done of course. Eg.:

Load the page:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

And with the help of str_extract from stringr package and a simple regular expression you are done:

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

Well, body looks ugly, you will have to clean it up from <p> and scripts also. This can be done with gsub and friends (nice regular expressions). For example:

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

As @Andrie suggested, you should rather use some packages build for this purpose. Small demo:

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

Where body results in a clean text:

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

Update:The above as a one-liner (thanks to @Martin Morgan for suggestion):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)

Extracting html tables from website

Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.

Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.

You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.

UTF-8 file output in R

The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)

To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines:

txt <- "在"
writeLines(txt, "test.txt", useBytes=T)

readLines("test.txt", encoding="UTF-8")
[1] "在"

Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.

R: Extracting "Clean" Utf-8 Text from a Web Page Scraped with Rcurl