Open-Uri Returning Ascii-8Bit from Webpage Encoded in Iso-8859

Specify default charset using open-uri but use server-provided charset if given

OpenURI::Meta#charset accepts a block which will return a charset only if the server did not specify one.

Using that information, we can set the encoding of the StringIO returned by open to either the same encoding it had (redundantly) or to our default:

open('http://localhost:3333').tap do |io|
charset = io.charset { 'utf-8' }
io.set_encoding(charset)
end

Encoding::UndefinedConversionError when using open-uri

In the introduction to the open-uri module, the docs say this,

It is possible to open an http, https or ftp URL as though it were a file

And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).

In the first code example in the docs, there is this:

  open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}

So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:

The default external Encoding is pulled from your environment...Have a
look:

$ echo $LC_CTYPE
en_US.UTF-8

or

$ ruby -e 'puts Encoding.default_external.name'
UTF-8

http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings

On Mac OSX, I actually have to do the following to see the default external encoding:

$ echo $LANG

You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:

  open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end

Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

Ruby 1.9 iso-8859-8-i encoding

When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.

However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.

E.g.

contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents

At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.

I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).

Ruby 2: Detect encoding from binary ASCII-8BIT data

I had a quick google and found the Charlock Holmes gem by Brian Lopez. It looks like it does the detection process you're after.

https://github.com/brianmario/charlock_holmes

display iso-8859-1 encoded data gives strange characters

Found myself an answer by trying different things from the documentation:

require 'csv'

filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row|
# ↳ returns a copy transcoded to UTF-8.
puts row
end
end

As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.


Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):

Encoding::InvalidByteSequenceError: "\xD8" on UTF-8

Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8



Related Topics



Leave a reply



Submit