Converting UTF8 to ANSI with Ruby
ascii_str = yourUTF8text.unpack("U*").map{|c|c.chr}.join
assuming that your text really does fit in the ascii character set.
Converting ANSI to UTF8 with Ruby
If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.
Or, if there is characters above 0x7F, you could use Iconv
text=Iconv.iconv('UTF-8', 'ascii',text)
How to read a ANSI text file and convert strings to UTF-8 in Ruby 1.9?
The character is part of the ISO-8859-1 and Win-1252 character sets, among others. The second is probably the most popular character set for Windows, and is your most likely source.
RUBY_VERSION # => "1.9.2"
That's my Ruby version running the following tests. Note that in the following samples the # encoding
lines aren't comments, they're directives to Ruby on which character set to use when unencoded binary characters are found:
# encoding: Windows-1252
RUBY_VERSION # => "1.9.2"
asdf = "\xe9"
asdf.encoding # => #<Encoding:Windows-1252>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>
This shows the character in ISO-8859-1:
# encoding: ISO-8859-1
RUBY_VERSION # => "1.9.2"
asdf = "\xe9"
asdf.encoding # => #<Encoding:ISO-8859-1>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>
James Gray did a series of articles a couple years ago about dealing with this stuff. It's good reading.
Now, back to trying to figure out what character set a character could be in: When you only have one character, because it could be in several sets at once, it is difficult to determine which set it is. If you have more characters >= "\x80" then you can run through the characters sets iconv
support and try converting them. That's messy, but I had to do that in Perl for some screen scraping about five years ago. An alternative is to use the Python chardet
code.
James Gray's articles have a link to an article recommending rchardet
.
The above routines mention Mozilla's Charset Detectors, which will give you more info on dealing with this.
Batch convert to UTF8 using Ruby
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv
, which also has ruby bindings.
Ruby converting string encoding from ISO-8859-1 to UTF-8 not working
You assign a string, in UTF-8. It contains ä
. UTF-8 represents ä
with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä
any more. It contains two characters, Ã
and ¤
.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8
. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
Related Topics
Create Array of N Items Based on Integer Value
Sidekiq: Ensure All Jobs on the Queue Are Unique
Is There a Shorter Way to Require a File in the Same Directory in Ruby
Ruby on Rails: Debugging Rake Tasks
Is Inject the Same Thing as Reduce in Ruby
Rspec: How to Stub an Instance Method Called by Constructor
Conditional Key/Value in a Ruby Hash
In Ruby What Does "=>" Mean and How Does It Work
Edit Each Line in a File in Ruby
Render an Erb Template with Values from a Hash
Skip Over Iteration in Enumerable#Collect
Simple Cropping with Paperclip
Homebrew Install: Failed During: Git Fetch Origin Master:Refs/Remotes/Origin/Master -N --Depth=1