Specify default charset using open-uri but use server-provided charset if given
OpenURI::Meta#charset
accepts a block which will return a charset only if the server did not specify one.
Using that information, we can set the encoding of the StringIO
returned by open
to either the same encoding it had (redundantly) or to our default:
open('http://localhost:3333').tap do |io|
charset = io.charset { 'utf-8' }
io.set_encoding(charset)
end
Encoding::UndefinedConversionError when using open-uri
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset
. If that encoding is different than your default external encoding
, you will most likely get an error. Your default external encoding
is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=()
, so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0
, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Ruby converting string encoding from ISO-8859-1 to UTF-8 not working
You assign a string, in UTF-8. It contains ä
. UTF-8 represents ä
with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä
any more. It contains two characters, Ã
and ¤
.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8
. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
Ruby 1.9 iso-8859-8-i encoding
When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.
However, if you know from some other source what the correct encoding is, you can use force_encoding
method to tell Ruby how to interpret the bytes it has loaded into a String
. Note this alters the object in place.
E.g.
contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents
At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.
I could not find 'ISO-8859-8-I'
on my version of Ruby. I am not sure yet how close 'ISO-8859-8'
is to what you need (some Googling suggests that it may be OK for you, if the ...-I
encoding is not available).
Ruby 2: Detect encoding from binary ASCII-8BIT data
I had a quick google and found the Charlock Holmes gem by Brian Lopez. It looks like it does the detection process you're after.
https://github.com/brianmario/charlock_holmes
display iso-8859-1 encoded data gives strange characters
Found myself an answer by trying different things from the documentation:
require 'csv'
filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row|
# ↳ returns a copy transcoded to UTF-8.
puts row
end
end
As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.
Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):
Encoding::InvalidByteSequenceError: "\xD8" on UTF-8
Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8
Related Topics
Parsing Large Xml Files W/ Ruby & Nokogiri
Ruby Errors with Os X Yosemite
Execjs::Programerror in Welcome#Index Typeerror: Object Doesn't Support This Property or Method
Rake Db:Migrate Is Being Aborted Due to Rake Version Difference
Devise Install from Existing Model/Database
Accessing Variables from Included Files in Ruby
Generate a Nested JSON Array in Jbuilder
Using Ruby to Generate Sha512 Crypt-Style Hashes Formatted for /Etc/Shadow
Rails 3.2 Activeadmin 'Collection Is Not a Paginated Scope.' Error
Getting Ruby Function Object Itself
What's the Best Background Job Management Library for Rails
How to Render a String as an Erb File
Is Subclassing a User Model Really Bad to Do in Rails
Regular Expression Matching Emoji in MAC Os X/Ios
How to Show Error Message on Rails Views
Determine the Class to Which a Method Belongs in Rails