Transliteration with Iconv in Ruby

Transliteration with Iconv in Ruby

It seems the solution is too tricky for me. Problem solved using stringex gem.

Transliteration in ruby

Ruby has an Iconv library in its stdlib which converts encodings in a very similar way to the usual iconv command

Are Iconv.convert return values in wrong order?

Running it through Unicode decomposition (as people kind of mentioned in the forum thread you linked to) seems to do it on my OS X:

iex> :iconv.convert "utf-8", "ascii//translit", String.normalize("árboles más grandes", :nfd)
"arboles mas grandes"

Decomposition means it will be normalized so that e.g. "á" is represented as two Unicode codepoints ("a" and a combining accent) as opposed to a composed form where it's a single Unicode codepoint. So I guess iconv's ASCII transliteration removes standalone accents/diacritics, but converts composed characters to things like 'a.

Iconv and Kconv on Ruby (1.9.2)

As https://stackoverflow.com/users/23649/jtbandes says, it looks Kconv is like Iconv but specialized for Kanji ("the logographic Chinese characters that are used in the modern Japanese writing system along with hiragana" http://en.wikipedia.org/wiki/Kanji). Unless you are working on something specifically Japanese, I'm guessing you don't need Kconv.

If you're using Ruby 1.9, you can use the built-in encoding support most of the time instead of Iconv. I tried for hours to understand what I was doing until I read this:

http://www.joelonsoftware.com/articles/Unicode.html

Then you can start to use stuff like

String#encode           # Ruby 1.9
String#encode! # Ruby 1.9
String#force_encoding # Ruby 1.9

with confidence. If you have more complex needs, do read http://blog.grayproductions.net/categories/character_encodings

UPDATED Thanks to JohnZ in the comments

Iconv is still useful in Ruby 1.9 because it can transliterate characters (something that String#encode et al. can't do). Here's an example of how to extend String with a function that transliterates to UTF-8:

require 'iconv'
class ::String
# Return a new String that has been transliterated into UTF-8
# Should work in Ruby 1.8 and Ruby 1.9 thanks to http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
def as_utf8(from_encoding = 'UTF-8')
::Iconv.conv('UTF-8//TRANSLIT', from_encoding, self + ' ')[0..-2]
end
end

"foo".as_utf8 #=> "foo"
"foo".as_utf8('ISO-8859-1') #=> "foo"

Thanks JohnZ!

iconv utf-8 to ascii transliteration in mod_php/apache2

figured out that the locale wasnt set up correctly and my attempts to set it failed as they locales available on the system were actually named different then the manpage examples (according to their encoding!)
a simple locale -a revealed that ;O)

setlocale(LC_ALL, "en_US.utf8");

this actually did the job!

well now this function works perfectly.

well now ita clear why it worked from the console as well, because the locale was imported from the current users shell settings ;)
it actually just needs any locale set up. doesnt really matter which one as we convert to ascii where everybody is equal, only some are more equal than others :)

Be careful to set a locale that is actually installed in your system and check the result of setlocale, because you won't change anything if the locale is not installed or name is misspelled.



Related Topics



Leave a reply



Submit