Regular Expression Matching Emoji in MAC Os X/Ios

Regular expression matching emoji in Mac OS X / iOS

The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.

By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, it’s trivial to turn it into a regular expression.

Here’s a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And here’s the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js

How can I match emoji with an R regex?

I am converting the encoding to UTF-8 to compare the UTF-8 value of emoji's value with all the emoji's value in remoji library which is in UTF-8. I am using the stringr library to find the position of emoji's in the vector. One is free to use grep or any other function.

1st Method:

library(stringr)
xvect = c('', 'no', '', '', 'no', '')

Encoding(xvect) <- "UTF-8"

which(str_detect(xvect,"[^[:ascii:]]")==T)
# [1] 1 3 4 6

Here 1,3,4 and 6 are emoji's character in this case.

Edited :

2nd Method:
Install a package called remoji using devtools using below command, Since we have already converted the emoji items into UTF-8. we can now compare the UTF-8 values of all the emoji's present in the emoji library. Use trimws to remove the whitespaces

install.packages("devtools")

devtools::install_github("richfitz/remoji")
library(remoji)
emj <- emoji(list_emoji(), TRUE)
xvect %in% trimws(emj)

Output:

which(xvect %in% trimws(emo))
# [1] 1 3 4 6

Both of the above methods are not full proof and first method assumes that there are no any ascii characters other than emojis in the vector and second method relies on the library information of remoji. In case where the a certain emoji information is not present in the library, the last command may yield a FALSE instead of TRUE.

Final Edit:

As per the discussion amongst OP(@MichaelChirico) and @SymbolixAU. Thanks to both of them it seems the problem with small typo of capital U. The new regex is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. One can off course change this range to a new range in cases where an emoji lies outside this range. This may not be the complete list and over the period of time these ranges may keep increasing/changing.

How to transcoding emoji in iOS ?

Before saving the comment to server use the below code

NSData *dataForEmoji = [comment dataUsingEncoding:NSNonLossyASCIIStringEncoding];
NSString *encodevalue = [[NSString alloc]initWithData:dataForEmoji encoding:NSUTF8StringEncoding];

Save the encodedvalue to your server.

When you retrieve use below code before you display

NSString *emojiText = [NSString stringWithCString:[textFromServer cStringUsingEncoding:NSUTF8StringEncoding]
encoding:NSNonLossyASCIIStringEncoding];

How could I get Apple emoji name instead of Unicode name?

The first one is the Unicode name, though the correct name is:

SMILING FACE WITH OPEN MOUTH AND SMILING EYES

The fact that it's uppercase matters. It's a Unicode identifier. It's permanent and it's unique. (It's really permanent, even if they misspell a word like "BRAKCET" in "PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET", that name is forever).

The second name is the "Apple Name." These are localized names. On Mac, the English version is stored in:

/System/Library/PrivateFrameworks/CoreEmoji.framework/Versions/A/Resources/en.lproj/AppleName.strings

You can dump this file with plutil, or read it using PropertyListDecoder.

$ plutil -p AppleName.strings
{
"〰" => "wavy dash"
"‼️" => "red double exclamation mark"
"⁉️" => "red exclamation mark and question mark"
"*️⃣" => "keycap asterisk"
"#️⃣" => "number sign"
"〽️" => "part alternation mark"
"©" => "copyright sign"
...

That said, unless you absolutely need to match Apple, I'd recommend the CLDR (Common Locale Data Repository) annotation short name. That's the Unicode source for localized names. They're not promised to be unique, though. Their biggest purpose is for supporting text-to-speech.

For the current list in XML, it's most convenient on GitHub. Or you can browse the v37 table or download the raw data.

iOS 5.0 Check if NSString contains Emoji characters

I was able to detect all emojis in iOS 5 and iOS 6 emoji keyboards using following method
https://gist.github.com/4146056

Filter out multiple emojis from Unicode text in Python

Little change in emoji_pattern will do the job:

emoji_pattern = re.compile(u"(["                     # .* removed
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"])", flags= re.UNICODE) # + removed

for sent in [sent1, sent2, sent3]:
print(''.join(re.findall(emoji_pattern, sent)))

br>br>br>

Inconsistently handled emoji sequences on iOS?

After some investigation, it appears that neither is wrong, although the method implemented in Swift 4 is more true to recommendations.

As per the Unicode standard (emphasis mine):

The representative glyph for a single regional indicator symbol is just a dotted box containing a capital Latin letter. The Unicode Standard does not prescribe how the pairs of regional indicator symbols should be rendered. However, current industry practice widely interprets pairs of regional indicator symbols as representing a flag associated with the corresponding ISO 3166 region code.

– The Unicode Standard, Version 10.0 – Core Specification, page 836.

Then, on the following page:

Conformance to the Unicode Standard does not require conformance to UTS #51. However, the interpretation and display of pairs of regional indicator symbols as specified in UTS #51 is now widely deployed, so in practice it is not advisable to attempt to interpret pairs of regional indicator symbols as representing anything other than an emoji flag.

– The Unicode Standard, Version 10.0 – Core Specification, page 837.

From this I gather that while the standard doesn't set any rules for how the flags should be rendered, the chosen path for handling the rendering of invalid flag sequences in iOS and macOS is inadvisable. So, even if there exists a valid flag further in the sequence, the renderer should always consider two consecutive regional indicator symbols as a flag.

Finally, taking a look at UTS #51, or "the emoji specification":

Options for presenting an emoji_flag_sequence for which a system does not have a specific flag or other glyph include:

  • Displaying each REGIONAL INDICATOR symbol separately as a letter in a dotted square, as shown in the Unicode charts. This provides information about the specific region indicated, but may be mystifying to some users.

  • For all unsupported REGIONAL INDICATOR pairs, displaying the same “missing flag” glyph, such as the image shown below. This would indicate that the supported pair was intended to represent the flag of some region, without indicating which one.

Missing flag glyph.

– Unicode Technical Standard #51, revision 12, Annex B.

So, in conclusion, best practice would be representing invalid flag sequences as a pair of regional indicator symbols – exactly as is the case with Character objects in Swift 4 strings – or as a generic missing flag glyph.

Unable to match a sample emoji. What could be the reason for this?

The Problem

Javascript has had a Unicode Problem for a while. Unicode Codepoints that lie outside the range U+0000...U+FFFF are known as astral codepoints, and are problematic because they are not easy to match via a regex:

// `` is an astral symbol because its codepoint value
// of U+1F30D is outside the range U+0000...U+FFFF
// Astral symbols do not work with regular expressions as expected
var regex = /^[bc]$/;
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('') // false (!)
);
console.log(''.match(regex)); // null (!)

The reason is because this one astral codepoint is actually made up of two parts, or more precisely of two "code units", and these two code units combine together to form the character.

console.log("\u1F30D")      // Doesn't work
console.log("\uD83C\uDF0D") // br>

The astral symbol is actually made up of two code units: = U+D83C + U+DF0D!

So if you wanted to match this astral symbol, you would have to use the following regex and matcher:

var regex = /^([bc]|\uD83C\uDF0D)$/;
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('\uD83C\uDF0D') // true
);
console.log('\uD83C\uDF0D'.match(regex)); // { 0: ", 1: ", index: 0 ... }

All astral symbols have this decomposition. Surprised? Well perhaps you should be – this doesn't happen often! It only happens with astral codepoints which are rarely used. Most codepoints used by myself and others across the world are not astral – they're in the range U+0000...U+FFFF – so we don't typically see this issue. Emojis are a new exception to this rule – all emojis are astral symbols and thanks to social media, their usage is becoming increasingly popular across the world.

Using code units like this is an implementation detail of Unicode that was unfortunately exposed to Javascript programmers. It can easily cause confusion for programmers as it is unclear whether to use the character verbatim () or to instead use the code unit decomposition (U+D83C + U+DF0D) whenever string functions like match, test, ... are used; or whenever regexes and string literals are used. However language designers and implementers and working hard to improve things.

The Solution

A recent addition to ECMAScript 6 (ES6) was the introduction of a u flag to regular expression matching. This allows you to match by codepoint, rather than matching by code units (default).

var regex = /^[bc]$/u; // <-- u flag added
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('') // true <-- it now works!
);

By using the u flag, you don't have to worry about whether or not your codepoint is an astral codepoint and you don't have to convert to and from code units. The u flag makes regular expression work the intuitive way - even for emojis! However, not every version of Node.js and not every Browser supports this new feature. To support all environments, you could use a library like regenerate.



Related Topics



Leave a reply



Submit