recursively convert hash containing non-UTF chars to UTF
In Ruby 1.9 you can usually just flip the encoding using the encode
method. A wrapper around this that recursively transforms the hash, not unlike symbolize_keys
makes this straightforward:
class Hash
def to_utf8
Hash[
self.collect do |k, v|
if (v.respond_to?(:to_utf8))
[ k, v.to_utf8 ]
elsif (v.respond_to?(:encoding))
[ k, v.dup.encode('UTF-8') ]
else
[ k, v ]
end
end
]
end
end
rails omniauth and UTF-8 errors
Omniauth proved to be the problem producing the ASCII-8BIT
I ended up forcing the Omniauth hash into submission using:
omniauth_controller.rb
session[:omniauth] = omniauth.to_utf8
added recursive method to force convert the rogue ASCII-8BIT to UTF8
some_initializer.rb
class Hash
def to_utf8
Hash[
self.collect do |k, v|
if (v.respond_to?(:to_utf8))
[ k, v.to_utf8 ]
elsif (v.respond_to?(:encoding))
[ k, v.dup.force_encode('UTF-8') ]
else
[ k, v ]
end
end
]
end
end
Special thanks to tadman
recursively convert hash containing non-UTF chars to UTF
In Perl, how to create a mixed-encoding string (or a raw sequence of bytes) in a scalar?
So you have
my $in = "\x61\xC3\x8B\x00\xC3\xBF";
and you want
my $out = "\x61\xC3\x8B\x00\xFF";
This is the result of decoding only some parts of the input string, so you want something like the following:
sub decode_utf8 { my ($s) = @_; utf8::decode($s) or die("Invalid Input"); $s }
my $out = join "",
substr($in, 0, 3),
decode_utf8(substr($in, 3, 1)),
decode_utf8(substr($in, 4, 2));
Tested.
Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.
sub encode_utf8 { my ($s) = @_; utf8::encode($s); $s }
utf8::decode($in) or die("Invalid Input");
my $out = join "",
encode_utf8(substr($in, 0, 2)),
substr($in, 2, 1),
substr($in, 3, 1);
Tested.
You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.
How do I convert an ANSI encoded file to UTF-8 with Notepad++?
Regarding this part:
When I convert it to UTF-8 without bom and close file, the file is again ANSI when I reopen.
The easiest solution is to avoid the problem entirely by properly configuring Notepad++.
Try Settings
-> Preferences
-> New document
-> Encoding
-> choose UTF-8
without BOM, and check Apply to opened ANSI files
.
That way all the opened ANSI files will be treated as UTF-8 without BOM.
For explanation what's going on, read the comments below this answer.
To fully learn about Unicode and UTF-8, read this excellent article from Joel Spolsky.
PHP function to make slug (URL string)
Instead of a lengthy replace, try this one:
public static function slugify($text, string $divider = '-')
{
// replace non letter or digits by divider
$text = preg_replace('~[^\pL\d]+~u', $divider, $text);
// transliterate
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
// trim
$text = trim($text, $divider);
// remove duplicate divider
$text = preg_replace('~-+~', $divider, $text);
// lowercase
$text = strtolower($text);
if (empty($text)) {
return 'n-a';
}
return $text;
}
This was based off the one in Symfony's Jobeet tutorial.
Really Good, Bad UTF-8 example test data
Check out Markus Kuhn’s UTF-8 decoder stress test
Related Topics
How to Specify Formatting Options for To_Yaml in Ruby
How to Split String into 2 Parts After Certain Position
Use [].Replace to Make a Copy of an Array
Shading Mask Algorithm for Radiation Calculations
How to Interpolate a Variable in a Ruby Regex
When Should I Use Datetime VS Date, Time Fields in Ruby/Rails
Actionmailer Smtp "Certificate Verify Failed"
Rails: Merit Gem Badge Not Registering or Displaying
Adding Fields to Devise Sign Up Using Rails 4
Bundler How to Uninstall Conflicting Dependency
Share Session Between Two Rails4 Applications
How to Run Rails App in a Completely Isolated Instances of Chrome
Any Standard Guide for Ruby Win32Ole API
Rails 3.2 Force_Ssl Except on Landing Page
How to Distribute a Ruby Script via Homebrew
Carrierwave and Correct File Extension Depending on Its Contents