Does multibyte character interfere with end-line character within a regex?
In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.
Update: Two patches have been posted in Ruby trunk.
How to use sed expression for substituting double width characters with single width
If perl
is okay:
$ perl -Mopen=locale -Mutf8 -pe 'tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/' ip.txt
Part Number
123-956-AA
343-213-[E]
XTE-898-(5)
-Mopen=locale -Mutf8
to specify locale asutf8
tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/
translate characters as required, can also usey
instead oftr
sed (GNU sed) 4.2.2
can be used, but it doesn't support ranges
$ # simulating OP's POSIX locale
$ echo '91A9foo' | LC_ALL=C sed 'y/A9/A9/'
sed: -e expression #1, char 12: strings for `y' command are different lengths
$ # changing to a utf8 locale
$ echo '91A9foo' | LC_ALL=en_US.UTF-8 sed 'y/A9/A9/'
91A9foo
Further reading: https://wiki.archlinux.org/index.php/locale
Regexp non alphanumerical but not German characters
To remove all that is not a letter or a space you can use this:
str.gsub(/[^\p{L}\s]+/, '')
I use here a negated character class, [^\p{L}\s]
means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)
\p{L}
is an unicode character class for Letters.
You can easily add other characters you want to preserve like -
:
str.gsub(/[^\p{L}\s-]+/, '')
example script:
# encoding: UTF-8
str = "mönchengladbach."
str = str.gsub(/[^\p{L}\s]+/, '#')
puts str
php Regex - Pattern incorrect not getting desired results
ok, fortunately it seems php's multibyte functions supports Windows-1252 character encoding. This is what I have come up with.. Hope this works
$whole_wk_file = file_get_contents('Work.arx');
$pattern1 = '/\"[^\"\|]+\|[^\"\|]+\|[^\"\|]+\"/';
mb_internal_encoding("Windows-1252");
mb_eregi($pattern1, $whole_wk_file, $matches_wk);
print_r($matches_wk);
Regex to remove non letters
Just gsub!
is sufficient:
o.gsub!(/\W+/, '')
Note that gsub!
modifies the original o
object. Also, if the o
does not contain any non-word characters, the result will be nil
, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')
Remove non-utf8 characters from string
Using a regex approach:
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);
It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.
It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);
EDIT:
!empty(x)
will match non-empty values ("0"
is considered empty).x != ""
will match non-empty values, including"0"
.x !== ""
will match anything except""
.
x != ""
seem the best one to use in this case.
I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.
How to iterate UTF-8 string in PHP?
Use preg_split. With "u" modifier it supports UTF-8 unicode.
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Related Topics
Rails: Update Model Attribute Without Invoking Callbacks
Carrierwave Fog Amazon S3 Images Not Displaying
Is There a Ruby Http Client Library with a Response Cache
How to Create a Form in Rails Without Having to Use Form_For and a Model Instance
Including a Virtual Attribute in the Respond_With Hash
Ruby: Remove Whitespace Chars at the Beginning of a String
How to Embed Ruby in JavaScript (Rails + .Html.Erb File)
How to Integrate Rubocop with Rake
Retrieving Image Height with Carrierwave
"Gem Update --System Is Disabled on Debian" Error
Why Doesn't "Case" with "When > 2" Work
How to Deal with Ruby 2.1.2 Memory Leaks
Devise Nomethoderror 'For' Parametersanitizer
How to Find Best Matching Element in Array of Numbers