Regular Expression to Match Non-Ascii Characters

(grep) Regex to match non-ASCII characters?

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

  • [[:ascii:]] - matches a single ASCII char
  • [^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

Regular expression to match non-ASCII characters?

This should do it:

[^\x00-\x7F]+

It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F).

You can do the same thing with Unicode:

[^\u0000-\u007F]+

For unicode you can look at this 2 resources:

  • Code charts list of Unicode ranges
  • This tool to create a regex filtered by Unicode block.

Regex match character and non-ascii characters

Try the following expression:

^\*\s*=?\s*[[:^ascii:]\s]+[\r\n]*$

This matches the start-of-line ^, then it matches zero or more spaces \s* followed by an optional equal sign =? then zero or more white spaces \s*.

Now a nice piece of expression matches one or more characters which are a combination of non-ascii and white spaces [[:^ascii:]\s]+, check docs to see the syntax for character classes.

Finally the expression matches a combination of carriage returns and newlines which may end the line.

Regex101 Demo

Regular expression that finds and replaces non-ascii characters with Python

Updated for Python 3:

>>> 'Tannh‰user'.encode().decode('ascii', 'replace').replace(u'\ufffd', '_')
'Tannh___user'

First we create byte string using encode() - it uses UTF-8 codec by default. If you have byte string then of course skip this encode step.
Then we convert it to "normal" string using the ascii codec.

This uses the property of UTF-8 that all non-ascii characters are encoded as sequence of bytes with value >= 0x80.


Original answer – for Python 2:

How to do it using built-in str.decode method:

>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'

(You get unicode string, so convert it to str if you need.)

You can also convert unicode to str, so one non-ASCII character is replaced by ASCII one. But the problem is that unicode.encode with replace translates non-ASCII characters into '?', so you don't know if the question mark was there already before; see solution from Ignacio Vazquez-Abrams.


Another way, using ord() and comparing value of each character if it fits in ASCII range (0-127) - this works for unicode strings and for str in utf-8, latin and some other encodings:

>>> s = 'Tannh‰user' # or u'Tannh‰user' in Python 2
>>>
>>> ''.join(c if ord(c) < 128 else '_' for c in s)
'Tannh_user'

RegEx for removing non ASCII characters from both ends

To trim non word characters (upper \W) from start/end but also add the underscore which belongs to word characters [A-Za-z0-9_] you can drop the _ into a character class together with \W.

^[\W_]+|[\W_]+$

See demo at regex101. This is very similar to @CAustin's answer and @sln's comment.


To get the inverse demo and match everything from the first to the last alphanumeric character:

[^\W_](?:.*[^\W_])?

Or with alternation demo (|[^\W_] for strings having just one alnum in it).

[^\W_].*[^\W_]|[^\W_]

Both with re.DOTALL for multiline strings. Regex flavors without try [\s\S]* instead of .* demo



Related Topics



Leave a reply



Submit