Simple Conversion of String to Utf-8 in Ruby 1.8

Simple Conversion Of String To UTF-8 in Ruby 1.8

James Edward Gray II has a detailed collections of posts dealing with encoding and character set issues in Ruby 1.8. The post entitled Encoding Conversion with iconv contains detailed information.

Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:

gem install iconv

Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8

require 'iconv'

string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)

The order of arguments is:

  1. Target encoding
  2. Source encoding
  3. String to convert

How to convert hex into UTF-8 on ruby (1.8.7)?

Ruby String unpack? http://ruby-doc.org/core/classes/String.src/M001112.html.

For example:

"\x68\x65\x6c\x6c\x6f".unpack("Z*") --> "hello"

Handling string encoding with the same code in Ruby 1.8 and 1.9

I'm with Mike Lewis in using respond_to, but don't do it on the variable res everywhere throughout your code.

I took a look at your code in gateway.rb and it looks like everywhere you are using res, it gets set by a call to make_api_request so you could add this before your return statement in that method:

doc = doc.force_encoding("UTF-8") if doc.respond_to?(:force_encoding) 

Even if it's other places but it's not literally with every string you encounter, I'm sure you can find a way to refactor the code that makes sense and solves the problems in one place instead of everywhere you encounter it.

Are you having a problem with other places?

How do pack and unpack guesses the character encoding when converting to and from utf8?

This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.

The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.

Since you're trying to learn more about pack/unpack, let me explain more:

When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.

When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.

pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

Converting UTF8 to ANSI with Ruby

ascii_str = yourUTF8text.unpack("U*").map{|c|c.chr}.join

assuming that your text really does fit in the ascii character set.



Related Topics



Leave a reply



Submit