Equivalent of Iconv.conv(UTF-8//IGNORE,...) in Ruby 1.9.X?
I thought this was it:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
will replace all knowns with '?'.
To ignore all unknowns, :replace => ''
:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
Edit:
I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:
string.encode("UTF-8", ...).force_encoding('UTF-8')
Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.
Edit 2:
Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.
How to change deprecated iconv to String#encode for invalid UTF8 correction
The question that Martijn linked to has what seem to be the two best ways to do that, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Doing .encode('UTF-8', <options>).encode('UTF-8') doesn't work. As indicated in the original answer in the other question, the key is to encode to a different encoding, then back to UTF-8. If your original string is already flagged as UTF-8 in ruby's internals then ruby will ignore any call to encode it as UTF-8.
In the following examples I'm going to use "a#{0xFF.chr}b".force_encoding('UTF-8') to produce a string that ruby believes is UTF-8 but which contains invalid UTF-8 bytes.
1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8')
=> "a\xFFb"
1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>
Note how encoding to UTF-8 does nothing:
1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
=> "a\xFFb"
But encoding to something else (UTF-16) and then back to UTF-8 cleans up the string:
1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
=> "ab"
UTF-8 conversion not working with String#encode but Iconv
In your call to String#encode
you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:
Please note that conversion from an encoding
enc
to the same encodingenc
is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub
then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.
String#encode
has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:
git_log = git_log.encode(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
You could also use the !
form in this case, which has the same effect:
git_log.encode!(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
Ruby converting string encoding from ISO-8859-1 to UTF-8 not working
You assign a string, in UTF-8. It contains ä
. UTF-8 represents ä
with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä
any more. It contains two characters, Ã
and ¤
.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8
. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
ruby `encode': \xC3 from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
It seems you should use another encoding for the object. You should set the proper codepage to the variable @tree
, for instance, using iso-8859-1 instead of ascii-8bit by using @tree.force_encoding('ISO-8859-1')
. Because ASCII-8BIT
is used just for binary files.
To find the current external encoding for ruby, issue:
Encoding.default_external
If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:
In ruby to change encoding to utf-8 or another proper one, do as follows:
Encoding.default_external = Encoding::UTF_8
In bash,
grep
current valid set up:$ sudo env|grep UTF-8
LC_ALL=ru_RU.UTF-8
LANG=ru_RU.UTF-8Then set them in
.bashrc
properly, in a similar way, but not exactly withru_RU
language, such as the following:export LC_ALL=ru_RU.UTF-8
export LANG=ru_RU.UTF-8
when we import csv data, how eliminate invalid byte sequence in UTF-8
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines
could take in optional options :encoding
which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
Ruby 2.0 iconv replacement
Iconv was deprecated (removed) in 1.9.3.
You can still install it.
Reference Material if you unsure:
https://rvm.io/packages/iconv/
However the suggestion is that you don't and rather use:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
API
How to convert encoding from ASCII-8BIT to another, without passing through UTF-8 in ruby?
Given a string in binary (ASCII-8BIT
) encoding:
str = "sar\xE0".b #=> "sar\xE0"
str.encoding #=> #<Encoding:ASCII-8BIT>
You can tell Ruby that this string is actually in ISO-8859-1 via force_encoding
:
str.force_encoding('ISO-8859-1') #=> "sar\xE0"
str.encoding #=> #<Encoding:ISO-8859-1>
Note that you still see \xE0
because Ruby does not attempt to convert the character.
Printing the string on a UTF-8 terminal gives:
puts str
sar�
The replacement character � is shown, because 0xE0
is an invalid byte in UTF-8.
Printing the same string on a ISO-8859-1 terminal however gives:
puts str
sarà
To work with the string in Ruby, you usually want to convert it to UTF-8 via encode!
:
str.encode!('UTF-8') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>
Or in a single step by passing both, the destination encoding and the source encodings to encode!
:
str = "sar\xE0".b #=> "sar\xE0"
str.encode!('UTF-8', 'ISO-8859-1') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>
Related Topics
How to Write a JSON Schema for Array of Objects
How to Test If All Items in an Array Are Identical
Need Advice: Is This a Good Use Case for a 'Nosql' Database? If So, Which One
Ruby Mechanize Post with Header
Should I Deploy My Ruby on Rails Application on Heroku
Collecting Hashes into Openstruct Creates "Table" Entry
Ruby: Eval with String Interpolation
Carrierwave Fog Amazon S3 Images Not Displaying
Is There a Ruby Http Client Library with a Response Cache
How to Create a Form in Rails Without Having to Use Form_For and a Model Instance
Error: While Executing Gem ... (Typeerror) Incompatible Marshal File Format (Can't Be Read)
How to Make a Ruby Script Run Once a Second
API Errors Customization for Rails 3 Like Github API V3
Is the Unix Philosophy Falling Out of Favor in the Ruby Community