Putting French (Accented) Characters in Ruby File

How do I prevent French accent character truncation with Ruby 1.9, Rails 3.2, and MySQL?

Sorry for all the trouble. I'll just put down the answer. It just turned out in this case the database was correctly set up for utf8 but a user was inputing strings encoded in ISO-Latin-1 and I wasn't doing a check for what encoding user input as I assumed all input would be utf8 compatible. Turns out that french accent characters in ISO-Latin-1 are illegal utf8 characters. The database seems to handle it by just raising a warning and truncating the string at the point of the illegal character but keeping everything before it.

How do I replace accented Latin characters in Ruby?

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

R import of stata file has problems with French accented characters

Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).

So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.

First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.

The pes19_occ_text is encoded in UTF-8, as you can check with:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.

Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".

To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.

Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.

Here is the code:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

You can do the same on other variables with diacritics.


The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).

Actually, "c3 83" is "Ã" and "c2 a9" is "©".

Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).

See also the help pages of Encoding and iconv.


Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.

The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.

Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.


Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.

ces19web <- read_dta("CES-E-2019-online_F1.dta")

for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}

Ruby method to remove accents from UTF-8 international characters

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"

How does the magic comment ( # Encoding: utf-8 ) in ruby​​ work?

Ruby interpreter instructions at the top of the source file - this is called magic comment. Before processing your source code interpreter reads this line and sets proper encoding. It's quite common for interpreted languages I believe. At least Python uses the same approach.

You can specify encoding in a number of different ways (some of them are recognized by editors):

# encoding: UTF-8
# coding: UTF-8
# -*- coding: UTF-8 -*-

You can read some interesting stuff about source encoding in this article.

The only thing I'm aware of that has similar construction is shebang, but it is related to Unix shells in general and is not Ruby-specific.

magic_comments defined in ruby/ruby

How to allow right to left languages in Ruby on Rails 3

If you are using notepad++, first set the encoding to "Encode in UTF-8" and then start coding. If you have already created/saved the file then just changing the encoding type will not do. You will have to keep a copy of the existing code, then delete the existing file, open notepad++, set the encoding first(Encode in UTF-8) and then start writing/copying the code to it. This way utf-8 encoding is ensured and you won't have to put "# encoding: UTF-8" at the top of your file.



Related Topics



Leave a reply



Submit