Detecting *All* Emojis

How to extract all the emojis from text?

You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emoji

def extract_emojis(s):
return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

What is the regex to extract all the emojis from a string?


the pdf that you just mentioned says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?

Okay, but I will just note that the emoji in your question are outside that range! :-)

The fact that these are above 0xFFFF complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)

U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00; U+1F5FF ends up being \uD83D\uDDFF. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.

Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C followed by anything in the range \uDF00-\uDFFF (inclusive), or \uD83D followed by anything in the range \uDC00-\uDDFF (inclusive).

So armed with that knowledge, in theory we could now write a pattern:

// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");

That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C, and the second group for the pairs starting with \uD83D.

But that fails (doesn't find anything). I'm fairly sure it's because we're trying to specify half of a surrogate pair in various places:

Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^

We can't just split up surrogate pairs like that, they're called surrogate pairs for a reason. :-)

Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char arrays.

char arrays hold UTF-16 values, so we can find those half-pairs in the data if we look for it the hard way:

String s = new StringBuilder()
.append("Thats a nice joke ")
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.append(" ")
.appendCodePoint(0x1F61B)
.toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;

index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
ch1 = chars[index];
if ((int)ch1 == 0xD83C) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
else if ((int)ch1 == 0xD83D) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
++index;
}

Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF instead of 0xDDFF, it will. No idea if that would also include non-emojis, though.)


Source of my program to find out what the surrogate ranges were:

public class FindRanges {

public static void main(String[] args) {
char last0 = '\0';
char last1 = '\0';
for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
if (chars[0] != last0) {
if (last0 != '\0') {
System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
}
System.out.print("\\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \\u" + Integer.toHexString((int)chars[1]).toUpperCase());
last0 = chars[0];
}
last1 = chars[1];
}
if (last0 != '\0') {
System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
}
}
}

Output:

\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF

Regex for detecting emojis with condition

So, basically what you want is the following:

// detects if string consists of 0 to 5 emojis  
const regex = /^(emoji){0,5}$/;

Now the only missing part is the actual emoji-detection inside that regex. We can extract this out of this emoji-regex library you referenced:

const emojiRegex = require('emoji-regex/RGI_Emoji.js');

const regex = new RegExp("^(" + emojiRegex().source + "){0,5}$", emojiRegex().flags);

Didn't test, but something like this should work.

Detect emoticons in string

To follow on from Johannes answer I found a solution on a forum somewhere. This regex does the trick :)

    $unicodeRegexp = '([*#0-9](?>\\xEF\\xB8\\x8F)?\\xE2\\x83\\xA3|\\xC2[\\xA9\\xAE]|\\xE2..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?(?>\\xEF\\xB8\\x8F)?|\\xE3(?>\\x80[\\xB0\\xBD]|\\x8A[\\x97\\x99])(?>\\xEF\\xB8\\x8F)?|\\xF0\\x9F(?>[\\x80-\\x86].(?>\\xEF\\xB8\\x8F)?|\\x87.\\xF0\\x9F\\x87.|..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?|(((?<zwj>\\xE2\\x80\\x8D)\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F\k<zwj>\\xF0\\x9F..(\k<zwj>\\xF0\\x9F\\x91.)?|(\\xE2\\x80\\x8D\\xF0\\x9F\\x91.){2,3}))?))';

How to detect emojis in a String in Flutter using Dart?

Was looking for the same thing.

Found this newly published package:
https://pub.dartlang.org/packages/flutter_emoji

MIT license.

Looking at the source code it seems like this is the regex used:

/// A tweak regexp to pass all Emoji Unicode 11.0
/// TODO: improve this version, since it does not match the graphical bytes.
static final RegExp REGEX_EMOJI = RegExp(r'(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])');

I hope this information can be helpful.



Related Topics



Leave a reply



Submit