Does Multibyte Character Interfere with End-Line Character Within a Regex

Does multibyte character interfere with end-line character within a regex?

In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.

Update: Two patches have been posted in Ruby trunk.

How to use sed expression for substituting double width characters with single width

If perl is okay:

$ perl -Mopen=locale -Mutf8 -pe 'tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/' ip.txt
Part Number
123-956-AA
343-213-[E]
XTE-898-(5)
  • -Mopen=locale -Mutf8 to specify locale as utf8
  • tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/ translate characters as required, can also use y instead of tr


sed (GNU sed) 4.2.2 can be used, but it doesn't support ranges

$ # simulating OP's POSIX locale
$ echo '91A9foo' | LC_ALL=C sed 'y/A9/A9/'
sed: -e expression #1, char 12: strings for `y' command are different lengths

$ # changing to a utf8 locale
$ echo '91A9foo' | LC_ALL=en_US.UTF-8 sed 'y/A9/A9/'
91A9foo

Further reading: https://wiki.archlinux.org/index.php/locale

Regexp non alphanumerical but not German characters

To remove all that is not a letter or a space you can use this:

str.gsub(/[^\p{L}\s]+/, '')

I use here a negated character class, [^\p{L}\s] means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)

\p{L} is an unicode character class for Letters.

You can easily add other characters you want to preserve like -:

str.gsub(/[^\p{L}\s-]+/, '')

example script:

# encoding: UTF-8

str = "mönchengladbach."

str = str.gsub(/[^\p{L}\s]+/, '#')

puts str

php Regex - Pattern incorrect not getting desired results

ok, fortunately it seems php's multibyte functions supports Windows-1252 character encoding. This is what I have come up with.. Hope this works

$whole_wk_file = file_get_contents('Work.arx');

$pattern1 = '/\"[^\"\|]+\|[^\"\|]+\|[^\"\|]+\"/';

mb_internal_encoding("Windows-1252");
mb_eregi($pattern1, $whole_wk_file, $matches_wk);

print_r($matches_wk);

Regex to remove non letters

Just gsub! is sufficient:

o.gsub!(/\W+/, '')

Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.

You probably want this instead:

c = o.gsub(/\W+/, '')

Remove non-utf8 characters from string

Using a regex approach:

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

How to iterate UTF-8 string in PHP?

Use preg_split. With "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);


Related Topics



Leave a reply



Submit