Converting Utf8 to Ansi with Ruby

Converting UTF8 to ANSI with Ruby


ascii_str = yourUTF8text.unpack("U*").map{|c|c.chr}.join

assuming that your text really does fit in the ascii character set.

Converting ANSI to UTF8 with Ruby

If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.

Or, if there is characters above 0x7F, you could use Iconv

text=Iconv.iconv('UTF-8', 'ascii',text)

How to read a ANSI text file and convert strings to UTF-8 in Ruby 1.9?

The character is part of the ISO-8859-1 and Win-1252 character sets, among others. The second is probably the most popular character set for Windows, and is your most likely source.

RUBY_VERSION # => "1.9.2"

That's my Ruby version running the following tests. Note that in the following samples the # encoding lines aren't comments, they're directives to Ruby on which character set to use when unencoded binary characters are found:

# encoding: Windows-1252

RUBY_VERSION # => "1.9.2"

asdf = "\xe9"
asdf.encoding # => #<Encoding:Windows-1252>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>

This shows the character in ISO-8859-1:

# encoding: ISO-8859-1

RUBY_VERSION # => "1.9.2"

asdf = "\xe9"
asdf.encoding # => #<Encoding:ISO-8859-1>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>

James Gray did a series of articles a couple years ago about dealing with this stuff. It's good reading.

Now, back to trying to figure out what character set a character could be in: When you only have one character, because it could be in several sets at once, it is difficult to determine which set it is. If you have more characters >= "\x80" then you can run through the characters sets iconv support and try converting them. That's messy, but I had to do that in Perl for some screen scraping about five years ago. An alternative is to use the Python chardet code.

James Gray's articles have a link to an article recommending rchardet.

The above routines mention Mozilla's Charset Detectors, which will give you more info on dealing with this.

Batch convert to UTF8 using Ruby


Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.

UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.

It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.

Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')


Related Topics



Leave a reply



Submit