How to Check If a String Is Unicode or Ascii

How do I check if a string is unicode or ascii?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

vb6: how detect if string is unicode

Only the characters from 0 to 127 are "safe." ANSI character values from 128 to 255 have different meanings and character mappings in different locales.

For example, in the U.S. English locale:

Option Explicit

Private Sub Form_Load()
Dim S As String

S = "‰"
Debug.Print S, Asc(S), AscW(S)
End Sub

Produces:

‰              137           8240 

How to check if a string in Python is in ASCII?

def is_ascii(s):
return all(ord(c) < 128 for c in s)

Checking if string contains unicode using standard Python

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

how to distinguish Unicode characters and ASCII characters

ASCII characters exist in Unicode, they are Unicode codepoints U+0000 - U+007F, inclusive.

Java strings are represented in UTF-16, which is a 16-bit byte encoding of Unicode. Each Java char is a UTF-16 code unit. Unicode codepoints U+0000 - U+FFFF use 1 UTF-16 code unit and thus fit in a single char, whereas Unicode codepoints U+10000 and higher require a UTF-16 surrogate pair and thus need two chars.

If the string has UTF-16 code units represented as actual char values, then you can use Java's string methods that work with codepoints, eg:

private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j;
while (i < unicodeStr.length()) {
j = unicodeStr.offsetByCodePoints(i, 1);
list.add(unicodeStr.substring(i, j));
i = j;
}
return list.toArray(new String[list.size()]);
}

On the other hand, if the string has UTF-16 code units represented in an encoded "\uXXXX" format (ie, as 6 distinct characters - '\', 'u', ...), then things get a little more complicated as you have to parse the encoded sequences manually.

If you want to preserve the "\uXXXX" strings in your array, you could do something like this:

private boolean isUnicodeEncoded(string s, int index)
{
return (
(s.charAt(index) == '\\') &&
((index+5) < s.length()) &&
(s.charAt(index+1) == 'u')
);
}

private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j, start;
char ch;
while (i < unicodeStr.length()) {
start = i;
if (isUnicodeEncoded(unicodeStr, i)) {
ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch = unicodeStr.charAt(i);
j = 1;
}
i += j;
if (Character.isHighSurrogate(ch) && (i < unicodeStr.length())) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch = unicodeStr.charAt(i);
j = 1;
}
if (Character.isLowSurrogate(ch)) {
i += j;
}
}
list.add(unicodeStr.substring(start, i));
}
return list.toArray(new String[list.size()]);
}

If you want to decode the "\uXXXX" strings into actual chars in your array, you could do something like this instead:

private boolean isUnicodeEncoded(string s, int index)
{
return (
(s.charAt(index) == '\\') &&
((index+5) < s.length()) &&
(s.charAt(index+1) == 'u')
);
}

private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j;
char ch1, ch2;
while (i < unicodeStr.length()) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch1 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch1 = unicodeStr.charAt(i);
j = 1;
}
i += j;
if (Character.isHighSurrogate(ch1) && (i < unicodeStr.length())) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch2 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch2 = unicodeStr.charAt(i);
j = 1;
}
if (Character.isLowSurrogate(ch2)) {
list.add(String.valueOf(new char[]{ch1, ch2}));
i += j;
continue;
}
}
list.add(String.valueOf(ch1));
}
return list.toArray(new String[list.size()]);
}

Or, something like this (per https://stackoverflow.com/a/24046962/65863):

private String[] getCharArray(String unicodeStr) {
Properties p = new Properties();
p.load(new StringReader("key="+unicodeStr));
unicodeStr = p.getProperty("key");
ArrayList<String> list = new ArrayList<>();
int i = 0;
while (i < unicodeStr.length()) {
if (Character.isHighSurrogate(unicodeStr.charAt(i)) &&
((i+1) < unicodeStr.length()) &&
Character.isLowSurrogate(unicodeStr.charAt(i+1)))
{
list.add(unicodeStr.substring(i, i+2));
i += 2;
}
else {
list.add(unicodeStr.substring(i, i+1));
++i;
}
}
return list.toArray(new String[list.size()]);
}

How to know there is any UTF8 character in a string with Javascript?

A string is a series of characters, each which have a character code. ASCII defines characters from 0 to 127, so if a character in the string has a code greater than that, then it is a Unicode character. This function checks for that. See String#charCodeAt.

function hasUnicode (str) {
for (var i = 0; i < str.length; i++) {
if (str.charCodeAt(i) > 127) return true;
}
return false;
}

Then use it like, hasUnicode("Xin chào tất cả mọi người")

How to check a String if it's an ASCII or not?

If your strings are Unicode (and they really should be, nowadays), you can simply check that all code points are 127 or less. The bottom 128 code points of Unicode are ASCII.



Related Topics



Leave a reply



Submit