Ruby 1.9: Regular Expressions with Unknown Input Encoding

Ruby 1.9: Regular Expressions with unknown input encoding

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

Ruby 1.9 regex encoding

You need to encode the initial string and use the FIXEDENCODING option.

1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>

Replacing string in UTF-16LE encoded file

It should work if you're careful to encode everything in UTF-16LE.

re = Regexp.new('FILEVERSION \d\.\d'.encode('UTF-16LE'))
File.open(filepath,"rb:UTF-16LE") do |file|
    file.each do |line|
        line.gsub!(re, FILEVERSION.encode('UTF-16LE'))
    end
end

incompatible character encodings: ASCII-8BIT and UTF-8

I have a suspicion that you either copy/pasted a part of your Haml template into the file, or you're working with a non-Unicode/non-UTF-8 friendly editor.

See if you can recreate that file from the scratch in a UTF-8 friendly editor. There are plenty for any platform and see whether this fixes your problem. Start by erasing the line with #content and retyping it manually.

How to use Ruby to replace text in a VC++ resource file, when the encoding is all wacked out?

Reading and Writing the File

So, first thing I tried was looking for how to read/write UTF-16LE files in Ruby. I found this question and answer, which recommends always opening files in Text file mode (t) on Windows.

When dealing with text files, you should always pass the t modifier. It doesn't make any difference on most operating systems (which is why, unfortunately, most Rubyists forget to pass it), but it is crucial on Windows, which is what you appear to be using.

So, I did that

irb(main):002:0> File.open("source\\myproject\\app.rc", "rt:UTF-16LE")
ArgumentError: ASCII incompatible encoding needs binmode

I don't know what binmode is, but it might have something to do with the Binary file mode (b). So, let's try that instead.

irb(main):003:0> File.open("source\\myproject\\app.rc", "rb:UTF-16LE")
=> #<File:source\myproject\app.rc>

Eureka! However, I still see some crazy control characters and other unprintables (\n).

\r\n//\r\n\r\nVS_VERSION_INFO VERSIONINFO\r\n FILEVERSION 0,0,0,0\r\n PRODUCTVERSION 0,0,0
,0\r\n FILEFLAGSMASK 0x3fL\r\n#ifdef _DEBUG\r\n FILEFLAGS 0x1L\r\n#else\r\n FILEFLAGS 0x0L
\r\n#endif\r\n FILEOS 0x40004L\r\n FILETYPE 0x2L\r\n FILESUBTYPE 0x0L\r\nBEGIN\r\n    BLOC
K \"StringFileInfo\"\r\n    BEGIN\r\n        BLOCK \"040904b0\"\r\n        BEGIN\r\n

Replacing the Strings

So, you'll notice that doing a simple gsub like this produces an encoding error.

irb(main):004:0> c.gsub("0.0.0.0","0.0.5.0")
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

If you read the docs, gsub's first argument is turned into a Regexp, which is shown to be encode-able! So, let's try that...

irb(main):005:0> c.gsub("0.0.0.0".encode("UTF-16LE"),"0.0.5.0".encode("UTF-16LE"))
=> myproduct.dll\"\r\n            VALUE \"ProductName\", \"My Product\"\r\n            VALU
E \"ProductVersion\", \"0.0.5.0\"\r\n        END\r\n    END\r\n    BLOCK \"VarFileInfo\"\r
\n    BEGIN\r\n        VALUE \"Translation\", 0x409, 1200\r\n    END\r\nEND\r\n\r\n#endif

You can see some of the replacements working in the snippet I provided.

Ruby unfamiliar string usage with Integer.chr and \001

To make the code work in Ruby 1.9, try changing that line to:

flag = @data[2].ord & 2

Prior to Ruby 1.9, str[n] would return an integer between 0 and 255, but in Ruby 1.9 with its new unicode support, str[n] returns a character (string of length 1). To get the integer instead of character, you can call .ord on the character.

The & operator is just the standard bitwise AND operator common to C, Ruby, and many other languages.

Byte number three (0x03) is not a printable ASCII character, so when you have that byte in a string and call inspect ruby denotes that byte as \003. Just make sure you understand that "\003" is a single-byte string while '\003' is a four-byte string.

In Ruby, strings are really sequences of bytes. In Ruby 1.9, there is also encoding information, but they are still really just a sequence of bytes.

String is .blank? but neither empty nor whitespace

the three spaces: [32,160,32]

ASCII 160 is a non breaking space usually found in HTML, and apparently not recognized as squish as a space. Try to replace it before:

string.gsub(160.chr, ' ').squish

Unicode characters in a Ruby script?

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.

What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.

Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

Regex to match hashtags in a sentence using ruby

Can you try this regex:

/(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i

UPDATE 1:

There are a few cases where the above regex will not match like: #blah23blah and #23blah23.
Hence modified the regex to take care of all cases.

Regex:

/(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i

Breakdown:

(?:\s|^) --Matches the preceding space or start of line. Does not
capture the match.
# --Matches hash but does not capture.
(?!\d+(?:\s|$))) --Negative Lookahead to avoid ALL numeric characters
between # and space (or end of line)
(\w+) --Matches and captures all word characters
(?=\s|$) --Positive Lookahead to ensure following space or end of
line. This is required to ensure it matches adjacent valid hash tags.

Sample text modified to capture most cases:

#blah Pack my #box with #5 dozen #good2 #3good liquor.#jugs
link.com/liquor#jugs #mkvef214asdwq sd #3e4 flsd #2good #first#second #3

Matches:

Match 1: blah

Match 2: box

Match 3: good2

Match 4: 3good

Match 5: mkvef214asdwq

Match 6: 3e4

Match 7: 2good

Rubular link

UPDATE 2:

To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:

/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i

The sample, regex and matches are recorded in this Rubular link

Ruby 1.9: Regular Expressions with Unknown Input Encoding