Convert Non-Ascii Chars from Ascii-8Bit to Utf-8

How to convert encoding from ASCII-8BIT to another, without passing through UTF-8 in ruby?

Given a string in binary (ASCII-8BIT) encoding:

str = "sar\xE0".b #=> "sar\xE0"
str.encoding #=> #<Encoding:ASCII-8BIT>

You can tell Ruby that this string is actually in ISO-8859-1 via force_encoding:

str.force_encoding('ISO-8859-1') #=> "sar\xE0"
str.encoding #=> #<Encoding:ISO-8859-1>

Note that you still see \xE0 because Ruby does not attempt to convert the character.

Printing the string on a UTF-8 terminal gives:

puts str
sar�

The replacement character � is shown, because 0xE0 is an invalid byte in UTF-8.

Printing the same string on a ISO-8859-1 terminal however gives:

puts str
sarà

To work with the string in Ruby, you usually want to convert it to UTF-8 via encode!:

str.encode!('UTF-8') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>

Or in a single step by passing both, the destination encoding and the source encodings to encode!:

str = "sar\xE0".b                  #=> "sar\xE0"
str.encode!('UTF-8', 'ISO-8859-1') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>

getting encoding error cannot convert ascii-8bit to utf-8bit

changed 'w' to 'wb'

if params[:user][:image].present?
uploaded_io = params[:user][:image]
name = "image_" << @user.username << uploaded_io.original_filename
File.open(Rails.root.join('public', 'images','profile',name ), 'wb') do |file|
file.write(uploaded_io.read)
end
end

Force encode from US-ASCII to UTF-8 (iconv)

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

Encoding::UndefinedConversionError \xC2 from ASCII-8BIT to UTF-8 with redcarpet

in the end I solved this with adding force_encoding("UFT-8") to the html

like this:

      f.write html.force_encoding("UTF-8")

it fixed it.

Non escaped non ASCII character in non ASCII-8BIT script

I do not know anything about brakeman. But as your file is encoded in UTF-8, the byte stream of your regular expression is read in ASCII/ANSI with code page Windows-1252

/\「(?>[^\「\ã€\\]+|\\{2}|\\. )*\ã€/

which is with hexadecimal values

2F 5C E3 80 8C 28 3F 3E 5B 5E 5C E3 80 8C 5C E3 80 8D 5C 5C 5D 2B 7C 5C 5C 7B 32 7D 7C 5C 5C 2E 29 2A 5C E3 80 8D 2F

As you can see there are many "characters" (bytes) with a code value greater 127 decimal (hexadecimal 7F) without a backslash before if the byte stream is not first converted from UTF-8 to Unicode (usually UTF-16 Little Endian).

It is possible to write Perl regular expressions always without any character with a code value greater 127 even if the expression should find characters in full Unicode range.

In the scripts forum of text editor UltraEdit there is the topic Creating a Perl regular expression string with ANSI/Unicode characters which explains how such expression can be created and contains additionally a link to an UltraEdit script which uses mainly JavaScript code to convert a regular expression with ANSI or Unicode characters inside to an expression using their hexadecimal representations and therefore only ASCII characters.

Using this UltraEdit script within UltraEdit on your regular expression after removing the not necessary backslahes before the Unicode characters puts into clipboard the Perl regular expression string

/\x{300c}(?>[^\x{300c}\x{300d}\\]+|\\{2}|\\.)*\x{300d}/

For a Ruby script \u must be used instead of \x resulting in the expression:

/\u{300c}(?>[^\u{300c}\u{300d}\\]+|\\{2}|\\.)*\u{300d}/

And this regular expression string should find the same as your string without producing any warning by brakeman as it consists now only of ASCII characters with a code value smaller than 128 decimal.

ruby `encode': \xC3 from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

It seems you should use another encoding for the object. You should set the proper codepage to the variable @tree, for instance, using iso-8859-1 instead of ascii-8bit by using @tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.

To find the current external encoding for ruby, issue:

Encoding.default_external

If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:

  1. In ruby to change encoding to utf-8 or another proper one, do as follows:

    Encoding.default_external = Encoding::UTF_8
  2. In bash, grep current valid set up:

    $ sudo env|grep UTF-8
    LC_ALL=ru_RU.UTF-8
    LANG=ru_RU.UTF-8

    Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:

    export LC_ALL=ru_RU.UTF-8
    export LANG=ru_RU.UTF-8


Related Topics



Leave a reply



Submit