How do I check if a string is unicode or ascii?
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
vb6: how detect if string is unicode
Only the characters from 0 to 127 are "safe." ANSI character values from 128 to 255 have different meanings and character mappings in different locales.
For example, in the U.S. English locale:
Option Explicit
Private Sub Form_Load()
Dim S As String
S = "‰"
Debug.Print S, Asc(S), AscW(S)
End Sub
Produces:
‰ 137 8240
How to check if a string in Python is in ASCII?
def is_ascii(s):
return all(ord(c) < 128 for c in s)
Checking if string contains unicode using standard Python
There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).
If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:
re.compile(
u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
flags=re.UNICODE)
where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.
Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.
Another note: the above uses a \Uhhhhhhhh
wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.
how to distinguish Unicode characters and ASCII characters
ASCII characters exist in Unicode, they are Unicode codepoints U+0000 - U+007F, inclusive.
Java strings are represented in UTF-16, which is a 16-bit byte encoding of Unicode. Each Java char
is a UTF-16 code unit. Unicode codepoints U+0000 - U+FFFF use 1 UTF-16 code unit and thus fit in a single char
, whereas Unicode codepoints U+10000 and higher require a UTF-16 surrogate pair and thus need two char
s.
If the string has UTF-16 code units represented as actual char
values, then you can use Java's string
methods that work with codepoints, eg:
private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j;
while (i < unicodeStr.length()) {
j = unicodeStr.offsetByCodePoints(i, 1);
list.add(unicodeStr.substring(i, j));
i = j;
}
return list.toArray(new String[list.size()]);
}
On the other hand, if the string has UTF-16 code units represented in an encoded "\uXXXX"
format (ie, as 6 distinct characters - '\'
, 'u'
, ...), then things get a little more complicated as you have to parse the encoded sequences manually.
If you want to preserve the "\uXXXX" strings in your array, you could do something like this:
private boolean isUnicodeEncoded(string s, int index)
{
return (
(s.charAt(index) == '\\') &&
((index+5) < s.length()) &&
(s.charAt(index+1) == 'u')
);
}
private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j, start;
char ch;
while (i < unicodeStr.length()) {
start = i;
if (isUnicodeEncoded(unicodeStr, i)) {
ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch = unicodeStr.charAt(i);
j = 1;
}
i += j;
if (Character.isHighSurrogate(ch) && (i < unicodeStr.length())) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch = unicodeStr.charAt(i);
j = 1;
}
if (Character.isLowSurrogate(ch)) {
i += j;
}
}
list.add(unicodeStr.substring(start, i));
}
return list.toArray(new String[list.size()]);
}
If you want to decode the "\uXXXX" strings into actual chars in your array, you could do something like this instead:
private boolean isUnicodeEncoded(string s, int index)
{
return (
(s.charAt(index) == '\\') &&
((index+5) < s.length()) &&
(s.charAt(index+1) == 'u')
);
}
private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
int i = 0, j;
char ch1, ch2;
while (i < unicodeStr.length()) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch1 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch1 = unicodeStr.charAt(i);
j = 1;
}
i += j;
if (Character.isHighSurrogate(ch1) && (i < unicodeStr.length())) {
if (isUnicodeEncoded(unicodeStr, i)) {
ch2 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
j = 6;
}
else {
ch2 = unicodeStr.charAt(i);
j = 1;
}
if (Character.isLowSurrogate(ch2)) {
list.add(String.valueOf(new char[]{ch1, ch2}));
i += j;
continue;
}
}
list.add(String.valueOf(ch1));
}
return list.toArray(new String[list.size()]);
}
Or, something like this (per https://stackoverflow.com/a/24046962/65863):
private String[] getCharArray(String unicodeStr) {
Properties p = new Properties();
p.load(new StringReader("key="+unicodeStr));
unicodeStr = p.getProperty("key");
ArrayList<String> list = new ArrayList<>();
int i = 0;
while (i < unicodeStr.length()) {
if (Character.isHighSurrogate(unicodeStr.charAt(i)) &&
((i+1) < unicodeStr.length()) &&
Character.isLowSurrogate(unicodeStr.charAt(i+1)))
{
list.add(unicodeStr.substring(i, i+2));
i += 2;
}
else {
list.add(unicodeStr.substring(i, i+1));
++i;
}
}
return list.toArray(new String[list.size()]);
}
How to know there is any UTF8 character in a string with Javascript?
A string is a series of characters, each which have a character code. ASCII defines characters from 0 to 127, so if a character in the string has a code greater than that, then it is a Unicode character. This function checks for that. See String#charCodeAt.
function hasUnicode (str) {
for (var i = 0; i < str.length; i++) {
if (str.charCodeAt(i) > 127) return true;
}
return false;
}
Then use it like, hasUnicode("Xin chào tất cả mọi người")
How to check a String if it's an ASCII or not?
If your strings are Unicode (and they really should be, nowadays), you can simply check that all code points are 127 or less. The bottom 128 code points of Unicode are ASCII.
Related Topics
Check If File Has a CSV Format With Python
Selecting Specific Rows of CSV Based on a Column'S Value in Python
List Append Is Overwriting My Previous Values
How to Limit Iterations of a Loop in Python
Convert a Python Int into a Big-Endian String of Bytes
Why Does It Say That Module Pygame Has No Init Member
Check to See If Python Script Is Running
How to Obtain Second and Fourth Word from Each Line in a File
How to Compile Multiple Python Files into Single .Exe File Using Pyinstaller
Pyspark - Pass List as Parameter to Udf
Cast String to Float Is Not Supported in Linear Model
How to Get the Latest File in a Folder
Get the Last Sunday and Saturday'S Date in Python
How to Pass a .Txt File to a Function in Python
How to Select Percentage of Rows in Pandas Dataframe
How to Save Training History on Every Epoch in Keras
How Can Draw a Line Using the X and Y Coordinates of Two Points