Ruby - Utf-8 File Encoding

Set UTF-8 as default for Ruby 1.9.3

To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:

# encoding: utf-8

It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.

In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.

As for encoding defaults:

  • Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
  • Ruby 1.9: default string encoding is US_ASCII everywhere.
  • Ruby 2.0 and above: default string encoding is UTF-8.

Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.

Ruby - UTF-8 file encoding

No, there are not "exactly 3 ways" to specify the 'magic comment' -- there are an infinite number of them. Any comment on the first line that contains coding: will work, according to JEG2:

... the preferred way to set your source Encoding ... it's called a magic comment. If the first line of your code is a comment that includes the word coding, followed by a colon and space, and then an Encoding name, the source Encoding for that file is changed to the indicated Encoding.

So, any of these should work:

# coding: UTF-8
# encoding: UTF-8
# zencoding: UTF-8
# vocoding: UTF-8
# fun coding: UTF-8
# decoding: UTF-8
# 863280148705622662 coding: UTF-8 0072364213
# It was the night before Christmas and all through the house, not a creature was coding: UTF-8, not even with a mouse.

Ruby: Is there a way to specify your encoding in File.write?

AFIK you can't do it at the time of performing the write, but you can do it at the time of creating the File object; here an example of UTF8 encoding:

File.open(FILE_LOCATION, "w:UTF-8") do 
|f|
f.write(....)
end

Another possibility would be to use the external_encoding option:

File.open(FILE_LOCATION, "w", external_encoding: Encoding::UTF_8)

Of course this assumes that the data which is written, is a String. If you have (packed) binary data, you would use "wb" for openeing the file, and syswrite instead of write to write the data to the file.

UPDATE As engineersmnky points out in a comment, the arguments for the encoding can also be passed as parameter to the write method itself, for instance

IO::write(FILE_LOCATION, data_to_write, external_encoding: Encoding::UTF_8)

How can I use Net::Http to download a file with UTF-8 characters in it?

How can I: 1) Check the encoding of a remote file like that.

You can check the Content-Type header of the response, which, if present, may look something like this:

Content-Type: text/plain; charset=utf-8

As you can see, the encoding is specified there. If there's no Content-Type header, or if the charset is not specified, or if the charset is specified incorrectly, then you can't know the encoding of the text. There are gems that can try to guess the encoding(with increasing accuracy), e.g. rchardet, charlock_holmes, but for complete accuracy, you have to know the encoding before reading the text.

This code somehow thinks all files that are downloaded are encoded in
ASCII 8-bit.

In ruby, ASCII-8BIT is equivalent to binary, which means the Net::HTTP library just gives you a string containing a series of single bytes, and it's up to you to decide how to interpret those bytes.

If you want to interpret those bytes as UTF-8, then you do that with String#force_encoding():

text = text.force_encoding("UTF-8")

You might want to do that if, for instance, you want to do some regex matching on the string, and you want to match full characters(which might be multi-byte) rather than just single bytes.

Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8

Using String#encode('UTF-8') to convert ASCII-8BIT to UTF-8 doesn't work for bytes whose ascii codes are greater than 127:

(0..255).each do |ascii_code|
str = ascii_code.chr("ASCII-8BIT")
#puts str.encoding #=>ASCII-8BIT

begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
puts "Can't encode char with ascii code #{ascii_code} to UTF-8."
end

end

--output:--
Can't encode char with ascii code 128 to UTF-8.
Can't encode char with ascii code 129 to UTF-8.
Can't encode char with ascii code 130 to UTF-8.
...
...
Can't encode char with ascii code 253 to UTF-8.
Can't encode char with ascii code 254 to UTF-8.
Can't encode char with ascii code 255 to UTF-8.

Ruby just reads one byte at a time from the ASCII-8BIT string and tries to convert the character in the byte to UTF-8. So, while 128 may be a legal byte in UTF-8 when part of a multi-byte character sequence, 128 is not a legal UTF-8 character as a single byte.

As for writing the strings to a file, instead of this:

f = open(filename)

...if you want to output UTF-8 to the file, you would write:

f = open(filename, "w:UTF-8")

By default, ruby uses whatever the value of Encoding.default_external is to encode output to a file. The default_external encoding is pulled from your system's environment, or you can set it explicitly.

Ruby: File.read Error encoding:UTF-8

It seems you are using older Ruby version. Try this instead:

File.read(inputfile, :encoding => "UTF-8").gsub(/<group.*?type=\"public\".*?\/>/, "")

Ruby Encoding While File Writing

You need to open the file in binary to get the right encoding.

file = File.new(path, 'wb')

Check the encoding like this

puts file.encoding

It should be 'ASCII-8BIT'.
Do the same with your decrypted filecontent, it should be the same encoding, other wise you need to convert it like this.

Document.find(123).fetch_file.force_encoding('ASCII-8BIT')

You could also use File.binread(file) and File.binwrite(file, content)

http://ruby-doc.org/core-2.3.0/IO.html#method-c-binread

http://ruby-doc.org/core-2.3.0/IO.html#method-c-binwrite

Ruby: how to add # encoding: UTF-8 automatically?

Try magic_encoding gem, it can insert uft-8 magic comment to all ruby files in your app.

[EDIT]
Having switched to SublimeText now I use auto-encoding-for-ruby plugin.



Related Topics



Leave a reply



Submit