Recursively Convert Hash Containing Non-Utf Chars to Utf

recursively convert hash containing non-UTF chars to UTF

In Ruby 1.9 you can usually just flip the encoding using the encode method. A wrapper around this that recursively transforms the hash, not unlike symbolize_keys makes this straightforward:

class Hash
  def to_utf8
    Hash[
      self.collect do |k, v|
        if (v.respond_to?(:to_utf8))
          [ k, v.to_utf8 ]
        elsif (v.respond_to?(:encoding))
          [ k, v.dup.encode('UTF-8') ]
        else
          [ k, v ]
        end
      end
    ]
  end
end

rails omniauth and UTF-8 errors

Omniauth proved to be the problem producing the ASCII-8BIT

I ended up forcing the Omniauth hash into submission using:

omniauth_controller.rb

session[:omniauth] = omniauth.to_utf8

added recursive method to force convert the rogue ASCII-8BIT to UTF8

some_initializer.rb

class Hash
  def to_utf8
    Hash[
      self.collect do |k, v|
        if (v.respond_to?(:to_utf8))
          [ k, v.to_utf8 ]
        elsif (v.respond_to?(:encoding))
          [ k, v.dup.force_encode('UTF-8') ]
        else
          [ k, v ]
        end
      end
    ]
  end
end

Special thanks to tadman

recursively convert hash containing non-UTF chars to UTF

In Perl, how to create a mixed-encoding string (or a raw sequence of bytes) in a scalar?

So you have

my $in = "\x61\xC3\x8B\x00\xC3\xBF";

and you want

my $out = "\x61\xC3\x8B\x00\xFF";

This is the result of decoding only some parts of the input string, so you want something like the following:

sub decode_utf8 { my ($s) = @_; utf8::decode($s) or die("Invalid Input"); $s }

my $out = join "",
               substr($in, 0, 3),
   decode_utf8(substr($in, 3, 1)),
   decode_utf8(substr($in, 4, 2));

Tested.

Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.

sub encode_utf8 { my ($s) = @_; utf8::encode($s); $s }

utf8::decode($in) or die("Invalid Input");
my $out = join "",
   encode_utf8(substr($in, 0, 2)),
               substr($in, 2, 1),
               substr($in, 3, 1);

Tested.

You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.

How do I convert an ANSI encoded file to UTF-8 with Notepad++?

Regarding this part:

When I convert it to UTF-8 without bom and close file, the file is again ANSI when I reopen.

The easiest solution is to avoid the problem entirely by properly configuring Notepad++.

Try Settings -> Preferences -> New document -> Encoding -> choose UTF-8 without BOM, and check Apply to opened ANSI files.

notepad++ UTF-8 apply to opened ANSI files

That way all the opened ANSI files will be treated as UTF-8 without BOM.

For explanation what's going on, read the comments below this answer.

To fully learn about Unicode and UTF-8, read this excellent article from Joel Spolsky.

PHP function to make slug (URL string)

Instead of a lengthy replace, try this one:

public static function slugify($text, string $divider = '-')
{
  // replace non letter or digits by divider
  $text = preg_replace('~[^\pL\d]+~u', $divider, $text);

  // transliterate
  $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

  // remove unwanted characters
  $text = preg_replace('~[^-\w]+~', '', $text);

  // trim
  $text = trim($text, $divider);

  // remove duplicate divider
  $text = preg_replace('~-+~', $divider, $text);

  // lowercase
  $text = strtolower($text);

  if (empty($text)) {
    return 'n-a';
  }

  return $text;
}

This was based off the one in Symfony's Jobeet tutorial.

Really Good, Bad UTF-8 example test data

Check out Markus Kuhn’s UTF-8 decoder stress test

Recursively Convert Hash Containing Non-Utf Chars to Utf