Incompatible Character Encoding in rails - how to just fail/skip sensibly?
After much pain this is how I solved it.
You need to add default encoding to your environment.rb file, like so:
# Load the rails application
require File.expand_path('../application', __FILE__)
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
# Initialize the rails application
Stma::Application.initialize!
Apparently this is something to do with Ruby's roots in japan. When dealing with Japanese (or russian) characters this wouldn't be helpful so this sort of thing isn't there as standard.
I've then done the following:
mail_object = Mail.new(mail[0].attr["RFC822"])
subject = mail_object.subject.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if mail_object.subject
body_part = (mail_object.text_part || mail_object.html_part || mail_object).body.decoded
body = body_part.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if body_part
from = mail_object.from.join(",") if mail_object.from #deals with multiple addresses
to = mail_object.to.join(",") if mail_object.to #deals with multiple addresses
That should get all the main pieces into strings / text you can easily work with that won't fail nastily if somethings missing/unusual...etc. Hope that helps somebody...
Character encoding with Ruby 1.9.3 and the mail gem
After playing a bit, I found this:
body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part
You can extract the charset from the message like so.
message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset
Be careful with non-multipart, as the following can cause trouble:
body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...
body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
Character encoding with Ruby 1.9.3 and the mail gem
After playing a bit, I found this:
body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part
You can extract the charset from the message like so.
message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset
Be careful with non-multipart, as the following can cause trouble:
body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...
body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
How best to encode or clean up the email body when collecting mails through Ruby Net::IMAP
Dug around for many hours trying to solve this problem so adding my answer to a few of the threads I found...
https://stackoverflow.com/a/26604049/2386548
Hope that helps somebody...
before_action for specific controller
Just move your call into the controller you want it to run in.
class ApplicationController < ActionController::Base
# nothing here!
def test
# ...
end
end
class CatsController < ApplicationController
before_action :test, only: [:index]
end
class RabbitsController < ApplicationController
before_action :test, only: [:index]
end
Remove non-utf8 characters from string
Using a regex approach:
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);
It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.
It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);
EDIT:
!empty(x)
will match non-empty values ("0"
is considered empty).x != ""
will match non-empty values, including"0"
.x !== ""
will match anything except""
.
x != ""
seem the best one to use in this case.
I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.
Related Topics
Access 'Self' of an Object Through the Parameters
Working with Multiple Processes in Ruby
Rails Not Working for New Project. Showingerror " Superclass Mismatch for Class Cipher (Typeerror)"
Gem Install Debugger -V '1.5.0' Fails
How to Use Multiple Models for Tag_Cloud
Why Does Ruby '**' Operator Have Higher Precedence Than Unary '-'
Rspec -- Test If Method Called Its Block Parameter
Gem::Ext::Builderror: Error: Failed to Build Gem Native Extension. on Cenos 6.5
Error: Null Value in Column "Id" Violates Not-Null Constraint
Are There Any Additional Inject Shorthand
Ruby Soap4R Wsdl2Ruby.Rb Errors
Solr or Sphinx? Which Is Better
Regex to Match Something Based on What Was Matched Before
Carrierwave and Correct File Extension Depending on Its Contents
Ruby Sequel: Array Returned by Query Is Being Returned as a String Object, Not an Array Object