Nokogiri, Open-Uri, and Unicode Characters

Nokogiri, open-uri, and Unicode Characters

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

Rails won't read a link with nokogiri and open-uri

Try this:

def scrape
  @url = watched_link_params[:url]

  page = Nokogiri::HTML(open(@url))
  puts page
end

You will need to pass in the entire url, including the protocol designator; that is to say, you need to use http://www.google.com instead of www.google.com:

>> params = ActionController::Parameters.new(default: {url: 'http://www.google.com'})
>> watched_link_params = params.require(:default).permit(:url)
>> @url = watched_link_params[:url]
"http://www.google.com"
>> page = Nokogiri::HTML(open(@url))

Parsing document with special characters, using Nokogiri

EDIT: I did a bit more work looking at the page, how you're trying to process it, and think this works better. I changed how you process the page also, because it wasn't as clear as how I like seeing it, for maintainability and readability.

require 'addressable/uri'
require 'nokogiri'
require 'open-uri'

def get_chapter(base_url, params={})
  uri = Addressable::URI.parse(base_url)
  uri.query_values = params

  doc = Nokogiri::XML(open(uri.to_s))
  doc.encoding = 'UTF-8'

  div = doc.at_css('.result-text-style-normal')
  div.css('.footnotes').remove
  div.css('h4').remove

  doc
end

page = get_chapter('http://www.biblegateway.com/passage/', :search => 'Mateo1-2', :version => 'NVI')
puts page.content

Rather than build a URL like you were, I prefer seeing it passed in as chunks, with the base URL and parameters split. I build the URI using the Addressable gem, which is my go-to for munging URLs. Ruby's built-in URI is having some growing pains right now, related to encoding of parameters.

The document at the far end of the URL you gave says it is XHTML, so it should meet the XHTML specs. You can parse XHTML using Nokogiri::HTML() but I think you get better results using Nokogiri::XML(), which is more strict.

To give Nokogiri an additional nudge in the right direction for parsing the content, I add:

doc.encoding = 'UTF-8'

I prefer finding the desired div and assigning it to a temporary variable, and working from that point, rather than doing it chained to the parse step like you did. It's a bit more idiomatic and readable this way because we're dealing with chunks of the document.

Running the code outputs what appears to be nice and clean content. There is some embedded Javascript, but that is unavoidable because Javascript is treated as text inside the <script> tags. That isn't an issue if you are presenting the HTML for a browser to render.

Ruby - nokogiri, open-uri - Fail to parse page

First of all, why are you parsing it as XML?
The following should be correct, considering your page is a HTML website:

page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")

Furthermore, if you want to strip out all the links (a-tags), this is how:

page.css('a').each do |element|
   puts element
end

Nokogiri Chokes on Document with unicode char (i think) in title attribute

Are you sure that the character after Sã is a valid UTF-8 character?

Added There are illegal UTF-8 character sequences. To decode UTF-8 manually, try this decoder. You can enter the incoming hex and it will tell you what each individual byte means.

A good overview of UTF-8. UTF-8 code chart

Re: Removing the null character. Your code looks ok, try it out! But in addition, I'd investigate the source of the null in your incoming datastream.

Also, the binary UTF-8 of your Original Post is, in fact, the unknown character symbol--not your original datastream. Here is what is in your post:

53 C3 A3 EF BF BD 6E 67

Here is the decoding:

U+0053 LATIN CAPITAL LETTER S character
U+00E3 LATIN SMALL LETTER A WITH TILDE character (ã)
U+FFFD REPLACEMENT CHARACTER character (�)  # this is the char used when
                                                   # the orig is not understood.
U+006E LATIN SMALL LETTER N character
U+0067 LATIN SMALL LETTER G character

Ruby 2 Upgrade Breaks Nokogiri and/or open-uri Encoding?

Ok, here's an answer, and maybe the answer. Ruby 2 changed how it uses headers in HTTP requests and zipping/deflating, but at some point they changed their minds back and put it to be how 1.9 worked. In the interim some Rails gem maintainers monkey patched HTTP:Net to make their gems work on both 1.9 and 2.0. Those monkey patches still linger in older versions of gems and cause issues like I saw upgrading from 1.9 to 2.1

A summary of the issue and solution here:

http://avi.io/blog/2013/12/17/do-not-upgrade-your-rails-project-to-ruby-2-before-you-read-this/

We use the gem right_aws, and the details of that issue with ruby versions is here:

https://github.com/sferik/twitter/issues/473

The solution was to undo the monkey patch using this as a gem reference in our Gemfile:

gem 'right_http_connection', git: 'git://github.com/rightscale/right_http_connection.git', ref: '3359524d81'

Background reading and more info:

https://github.com/rightscale/right_aws/issues/167

HTML is read before fully loaded using open-uri and nokogiri

What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

This might open a browser window though.

Data scraping with Nokokiri and Pismo

See Phrogz answer here: Nokogiri, open-uri, and Unicode Characters which I think correctly describes what is happening for you. In summary, for some reason there is an issue passing the IO object created by open-url to nokogiri. Instead read the document in as a string and give that to Nokogiri, i.e.:

require 'nokogiri'
require 'open-uri'

open("https://www.youtube.com/watch?v=QXAwnMxlE2Q") {|f|
  p f.content_type     # "text/html"
  p f.charset          # "UTF-8"
  p f.content_encoding # []
}

doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"))
puts doc.title.to_s # =>  NTV interview foreigners in Japan æ¥ãã¬å¤äººè¡é ã¤ã³ã¿ãã¥ã¼ English Subtitles è±èªåå¹ - YouTube


doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q").read)
puts doc.title.to_s # => NTV interview foreigners in Japan 日テレ外人街頭インタビュー English Subtitles 英語字幕 - YouTube

If you know the content is always going to be UTF-8 you could of course to this:

doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"), nil, "UTF-8")

Nokogiri, Open-Uri, and Unicode Characters