How Can Non-Ascii Characters Be Removed from a String

How can non-ASCII characters be removed from a string?

This will search and replace all non ASCII letters:

String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

Remove non-ASCII characters from String in Java

I'm guessing that the source of the URL is more at fault. Perhaps you're fixing the wrong problem? Removing "strange" characters from a URI might give it an entirely different meaning.

With that said, you may be able to remove all of the non-ASCII characters with a simple string replacement:

String fixed = original.replaceAll("[^\\x20-\\x7e]", "");

Or you can extend that to all non-four-byte-UTF-8 characters if that doesn't cover the "�" character:

String fixed = original.replaceAll("[^\\u0000-\\uFFFF]", "");

How to remove non Ascii characters(non keyboard special charecters) from a text in hive

You can use

regex_replace('123Abh¿½ï¿½ï¿½ï¿½ï¿½v streeÁÉÍÓt', '[^\\x{0000}-\\x7E]+', '')

Here,

  • [^ - start of a negated character class that matches any chars but
    • \x{0000}-\x7E - chars from NULL to ~ char in the ASCII table
  • ]+ - end of the class, match one or more times.

What if I need to remove all special characters apart from spaces and hyphens? - In this case, you need to use

regex_replace('123Abh¿½ï¿½ï¿½ï¿½ï¿½v streeÁÉÍÓt', '[^\\w\\s-]|_', '')

Here, [^\w\s-]|_+ matches any one symbol other than letter, digit, _, whitespace and -, or an underscore (note \w matches underscores, thus it must be added via a |, an alternation operator).

Remove non-ascii and special characters from a string Python

You can try using simple Regex and .replace() -

import re

my_string = "Bjørn 10.2.3"
new_string = re.sub('[^A-z0-9 -]', '', my_string).replace(" ", " ")
print (new_string)

Output:

Bjrn 1023

Remove non-ASCII non-printable characters from a String

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\\P{Print}", "");

Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

How can you strip non-ASCII characters from a string? (in C#)

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);


Related Topics



Leave a reply



Submit