How to Recognize If a String Contains Unicode Chars

How to recognize if a string contains unicode chars?

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.

    public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";

bool hasUnicode;

//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);

//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}

public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;

return input.Any(c => c > MaxAnsiCode);
}

Update

This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.

Check if java string contains unicode character

You can invert the font selection logic:

The Font class has goodies like canDisplay and canDisplayUpTo. Javadoc:

public int canDisplayUpTo​(String str)

Indicates whether or not this Font can display a specified String. For
strings with Unicode encoding, it is important to know if a particular
font can display the string. This method returns an offset into the
String str which is the first character this Font cannot display
without using the missing glyph code. If the Font can display all
characters, -1 is returned.

Check if String contains unicode character

In C#, unicode character escape sequences are written as \u25CF, while is XML or HTML.

So you should write

Text.Contains("\u25CF")

How to find out that string contain unicode character in C#

You could do something like this.

string input = ... // your input.

if(input.Any(c => c > 255))
{
// unicode
}

Check if string contains rage of Unicode characters

Is this what you want?

public static bool ContainsInvalidCharacters(string name)
{
return name.IndexOfAny(new[]
{
'\u0001', '\u0002', '\u0003',
}) != -1;
}

and

bool res = ContainsInvalidCharacters("Hello\u0001");

Note the use of '\uXXXX': the ' denote a char instead of a string.

Is there a way to check if a string contains a Unicode letter?

The main point here is that MATCHES requires a full string match, and also, \ backslash passed to the regex engine should be a literal backslash.

The regex can thus be

(?s).*\p{L}.*

Which means:

  • (?s) - enable DOTALL mode
  • .* - match 0 or more any characters
  • \p{L} - match a Unicode letter
  • .* - match zero or more characters.

In iOS, just double the backslashes:

NSPredicate * predicat = [NSPredicate predicateWithFormat:@"SELF MATCHES '(?s).*\\p{L}.*'"];

See IDEONE demo

If the backslashes inside the NSPrediciate are treated specifically, use:

NSPredicate * predicat = [NSPredicate predicateWithFormat:@"SELF MATCHES '(?s).*\\\\p{L}.*'"];

Checking if string contains unicode using standard Python

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

Check if string contains only Unicode values [\u0030-\u0039] or [\u0660-\u0669]

Use \x for unicode characters:

^([\x{0030}-\x{0039}\x{0660}-\x{0669}]+)$

if the patternt should match an empty string too, use * instead of +

Use this if you dont want to allows mixing characters from both sets you provided:

^([\x{0030}-\x{0039}]+|[\x{0660}-\x{0669}]+)$

https://regex101.com/r/xqWL4q/6

As mentioned by Holger in comments below. \x{0030}-\x{0039} is equivalent with [0-9]. So could be substituted and would be more readable.



Related Topics



Leave a reply



Submit