(grep) Regex to match non-ASCII characters?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]]
- matches a single ASCII char[^[:ascii:]]
- matches a single non-ASCII char
[^[:print:]]
will probably suffice for you.**
Regular expression to match non-ASCII characters?
This should do it:
[^\x00-\x7F]+
It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F).
You can do the same thing with Unicode:
[^\u0000-\u007F]+
For unicode you can look at this 2 resources:
- Code charts list of Unicode ranges
- This tool to create a regex filtered by Unicode block.
Regex match character and non-ascii characters
Try the following expression:
^\*\s*=?\s*[[:^ascii:]\s]+[\r\n]*$
This matches the start-of-line ^
, then it matches zero or more spaces \s*
followed by an optional equal sign =?
then zero or more white spaces \s*
.
Now a nice piece of expression matches one or more characters which are a combination of non-ascii and white spaces [[:^ascii:]\s]+
, check docs to see the syntax for character classes.
Finally the expression matches a combination of carriage returns and newlines which may end the line.
Regex101 Demo
Regular expression that finds and replaces non-ascii characters with Python
Updated for Python 3:
>>> 'Tannh‰user'.encode().decode('ascii', 'replace').replace(u'\ufffd', '_')
'Tannh___user'
First we create byte string using encode()
- it uses UTF-8 codec by default. If you have byte string then of course skip this encode step.
Then we convert it to "normal" string using the ascii codec.
This uses the property of UTF-8 that all non-ascii characters are encoded as sequence of bytes with value >= 0x80.
Original answer – for Python 2:
How to do it using built-in str.decode
method:
>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'
(You get unicode
string, so convert it to str
if you need.)
You can also convert unicode
to str
, so one non-ASCII character is replaced by ASCII one. But the problem is that unicode.encode
with replace
translates non-ASCII characters into '?'
, so you don't know if the question mark was there already before; see solution from Ignacio Vazquez-Abrams.
Another way, using ord()
and comparing value of each character if it fits in ASCII range (0-127) - this works for unicode
strings and for str
in utf-8, latin and some other encodings:
>>> s = 'Tannh‰user' # or u'Tannh‰user' in Python 2
>>>
>>> ''.join(c if ord(c) < 128 else '_' for c in s)
'Tannh_user'
RegEx for removing non ASCII characters from both ends
To trim non word characters (upper \W
) from start/end but also add the underscore which belongs to word characters [A-Za-z0-9_]
you can drop the _
into a character class together with \W
.
^[\W_]+|[\W_]+$
See demo at regex101. This is very similar to @CAustin's answer and @sln's comment.
To get the inverse demo and match everything from the first to the last alphanumeric character:
[^\W_](?:.*[^\W_])?
Or with alternation demo (|[^\W_]
for strings having just one alnum in it).
[^\W_].*[^\W_]|[^\W_]
Both with re.DOTALL
for multiline strings. Regex flavors without try [\s\S]*
instead of .*
demo
Related Topics
JavaScript or (||) Variable Assignment Explanation
How to Sort an Array of Objects by Multiple Fields
How to Split a String, Breaking At a Particular Character
How to Make Setinterval Also Work When a Tab Is Inactive in Chrome
Insert HTML into View from Angularjs Controller
JavaScript Equivalent of Python'S Zip Function
How to Send a Cross-Domain Post Request Via JavaScript
What Does "This" Refer to in Arrow Functions in Es6
Settimeout in For-Loop Does Not Print Consecutive Values
Generate Random String/Characters in JavaScript
Format Number to Always Show 2 Decimal Places
Convert a JavaScript String in Dot Notation into an Object Reference
Jquery: Return Data After Ajax Call Success
Query-String Encoding of a JavaScript Object