Equivalent of Iconv.Conv("Utf-8//Ignore",...) in Ruby 1.9.X

Equivalent of Iconv.conv(UTF-8//IGNORE,...) in Ruby 1.9.X?

I thought this was it:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

will replace all knowns with '?'.

To ignore all unknowns, :replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Edit:

I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

string.encode("UTF-8", ...).force_encoding('UTF-8')

Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

Edit 2:

Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.

How to change deprecated iconv to String#encode for invalid UTF8 correction

The question that Martijn linked to has what seem to be the two best ways to do that, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Doing .encode('UTF-8', <options>).encode('UTF-8') doesn't work. As indicated in the original answer in the other question, the key is to encode to a different encoding, then back to UTF-8. If your original string is already flagged as UTF-8 in ruby's internals then ruby will ignore any call to encode it as UTF-8.

In the following examples I'm going to use "a#{0xFF.chr}b".force_encoding('UTF-8') to produce a string that ruby believes is UTF-8 but which contains invalid UTF-8 bytes.

1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8')
=> "a\xFFb"
1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>

Note how encoding to UTF-8 does nothing:

1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
=> "a\xFFb"

But encoding to something else (UTF-16) and then back to UTF-8 cleans up the string:

1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
=> "ab"

UTF-8 conversion not working with String#encode but Iconv

In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.

String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:

git_log = git_log.encode(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')

You could also use the ! form in this case, which has the same effect:

git_log.encode!(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

ruby `encode': \xC3 from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

It seems you should use another encoding for the object. You should set the proper codepage to the variable @tree, for instance, using iso-8859-1 instead of ascii-8bit by using @tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.

To find the current external encoding for ruby, issue:

Encoding.default_external

If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:

  1. In ruby to change encoding to utf-8 or another proper one, do as follows:

    Encoding.default_external = Encoding::UTF_8
  2. In bash, grep current valid set up:

    $ sudo env|grep UTF-8
    LC_ALL=ru_RU.UTF-8
    LANG=ru_RU.UTF-8

    Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:

    export LC_ALL=ru_RU.UTF-8
    export LANG=ru_RU.UTF-8

when we import csv data, how eliminate invalid byte sequence in UTF-8

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})

Ruby 2.0 iconv replacement

Iconv was deprecated (removed) in 1.9.3.
You can still install it.

Reference Material if you unsure:
https://rvm.io/packages/iconv/

However the suggestion is that you don't and rather use:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

API

How to convert encoding from ASCII-8BIT to another, without passing through UTF-8 in ruby?

Given a string in binary (ASCII-8BIT) encoding:

str = "sar\xE0".b #=> "sar\xE0"
str.encoding #=> #<Encoding:ASCII-8BIT>

You can tell Ruby that this string is actually in ISO-8859-1 via force_encoding:

str.force_encoding('ISO-8859-1') #=> "sar\xE0"
str.encoding #=> #<Encoding:ISO-8859-1>

Note that you still see \xE0 because Ruby does not attempt to convert the character.

Printing the string on a UTF-8 terminal gives:

puts str
sar�

The replacement character � is shown, because 0xE0 is an invalid byte in UTF-8.

Printing the same string on a ISO-8859-1 terminal however gives:

puts str
sarà

To work with the string in Ruby, you usually want to convert it to UTF-8 via encode!:

str.encode!('UTF-8') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>

Or in a single step by passing both, the destination encoding and the source encodings to encode!:

str = "sar\xE0".b                  #=> "sar\xE0"
str.encode!('UTF-8', 'ISO-8859-1') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>


Related Topics



Leave a reply



Submit