Ruby Regex Error: Incompatible Encoding Regexp Match (Ascii-8Bit Regexp with Utf-8 String)

Ruby Regex Error: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

Your Regex is being "compiled" as ASCII-8BIT.

Just add the encoding declaration at the top of the file where the Regex is declared:

# encoding: utf-8

And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

UPDATE: UTF-8 is now the default encoding for Ruby 2.0 and beyond.

Regex Error - (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:

source_code.force_encoding(Encoding::UTF_8)

If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.

Also as @WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.

Ruby: incompatible encoding regexp match

Just encode the regex in UTF-8:

str = 'é'
arr = str.split(/x/mu)
#=> ["é"]

Documentation: https://ruby-doc.org/core-2.3.1/Regexp.html#class-Regexp-label-Encoding



Related Topics



Leave a reply



Submit