How to extract all the emojis from text?
You can use the emoji
library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI
.
import emoji
def extract_emojis(s):
return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])
Extract emoji from series of text
Rather than iterating over the entire dataset. You can apply the function using apply
or lambda
.
import pandas as pd
import emoji
df = pd.DataFrame([['@philip '],
['Rocky Mountain ❤️']],columns = ['comments'])
Using Lambda:
df['emojis'] = df['comments'].apply(lambda row: ''.join(c for c in row if c in emoji.UNICODE_EMOJI))
df
using Apply
def extract_emojis(text):
return ''.join(c for c in text if c in emoji.UNICODE_EMOJI)
df['emoji_apply'] = df['comments'].apply(extract_emojis)
df
Output:
comments emojis
@philip br>Rocky Mountain ❤️ ❤
How to extract emojis from text and then add them to a new column?
Something along the following line should work for your purposes:
import pandas as pd
import emoji as emj
EMOJIS = emj.UNICODE_EMOJI["en"]
df = pd.DataFrame(
data={
"text": [
"This is good quot;,
"Loving you so much ❤️",
"You make me sad! quot;,
]
}
)
def extract_emoji(df):
df["emoji"] = ""
for index, row in df.iterrows():
for emoji in EMOJIS:
if emoji in row["text"]:
row["text"] = row["text"].replace(emoji, "")
row["emoji"] += emoji
extract_emoji(df)
print(df.to_string())
text emoji
0 This is good br>1 Loving you so much ️ ❤️br>2 You make me sad! br>
Note that extract_emoji
modifies the DataFrame
in place.
What is the regex to extract all the emojis from a string?
the pdf that you just mentioned says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?
Okay, but I will just note that the emoji in your question are outside that range! :-)
The fact that these are above 0xFFFF
complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)
U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00
; U+1F5FF ends up being \uD83D\uDDFF
. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.
Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C
followed by anything in the range \uDF00-\uDFFF
(inclusive), or \uD83D
followed by anything in the range \uDC00-\uDDFF
(inclusive).
So armed with that knowledge, in theory we could now write a pattern:
// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C
, and the second group for the pairs starting with \uD83D
.
But that fails (doesn't find anything). I'm fairly sure it's because we're trying to specify half of a surrogate pair in various places:
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^
We can't just split up surrogate pairs like that, they're called surrogate pairs for a reason. :-)
Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char
arrays.
char
arrays hold UTF-16 values, so we can find those half-pairs in the data if we look for it the hard way:
String s = new StringBuilder()
.append("Thats a nice joke ")
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.append(" ")
.appendCodePoint(0x1F61B)
.toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;
index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
ch1 = chars[index];
if ((int)ch1 == 0xD83C) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
else if ((int)ch1 == 0xD83D) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
++index;
}
Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF
instead of 0xDDFF
, it will. No idea if that would also include non-emojis, though.)
Source of my program to find out what the surrogate ranges were:
public class FindRanges {
public static void main(String[] args) {
char last0 = '\0';
char last1 = '\0';
for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
if (chars[0] != last0) {
if (last0 != '\0') {
System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
}
System.out.print("\\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \\u" + Integer.toHexString((int)chars[1]).toUpperCase());
last0 = chars[0];
}
last1 = chars[1];
}
if (last0 != '\0') {
System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
}
}
}
Output:
\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF
How to extract text and emojis from a string?
Add ['en']
to emoji.UNICODE_EMOJI
:
import emoji
text = "#samplesenti @emojitweets i ❤❤❤ sentiment " analysis " http://senti.com/pic_01.jpg "
def extract_text_and_emoji(text=text):
global allchars, emoji_list
# remove all tagging and links, not need for sentiments
remove_keys = ("@", "http://", "&", "#")
clean_text = " ".join(
txt for txt in text.split() if not txt.startswith(remove_keys)
)
# print(clean_text)
# setup the input, get the characters and the emoji lists
allchars = [str for str in text]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI["en"]] # <-- HERE!
# extract text
clean_text = " ".join(
[
str
for str in clean_text.split()
if not any(i in str for i in emoji_list)
]
)
# extract emoji
clean_emoji = "".join(
[str for str in text.split() if any(i in str for i in emoji_list)]
)
return (clean_text, clean_emoji)
allchars, emoji_list = 0, 0
(clean_text, clean_emoji) = extract_text_and_emoji()
print("\nAll Char:", allchars)
print("\nAll Emoji:", emoji_list)
print("\n", clean_text)
print("\n", clean_emoji)
Prints:
All Char: ['#', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'i', ' ', '@', 'e', 'm', 'o', 'j', 'i', 't', 'w', 'e', 'e', 't', 's', ' ', 'i', ' ', '❤', '❤', '❤', ' ', 's', 'e', 'n', 't', 'i', 'm', 'e', 'n', 't', ' ', '&', 'q', 'u', 'o', 't', ';', ' ', 'a', 'n', 'a', 'l', 'y', 's', 'i', 's', ' ', '&', 'q', 'u', 'o', 't', ';', ' ', 'h', 't', 't', 'p', ':', '/', '/', 's', 'e', 'n', 't', 'i', '.', 'c', 'o', 'm', '/', 'p', 'i', 'c', '_', '0', '1', '.', 'j', 'p', 'g', ' ']
All Emoji: ['❤', '❤', '❤']
i sentiment analysis
❤❤❤
Extract Unicode-Emoticons in list, Python 3.x
Emojis exist in several Unicode ranges, represented by this regex pattern:
>>> import re
>>> emoji = re.compile('[\\u203C-\\u3299\\U0001F000-\\U0001F644]')
You can use that to filter your lists:
>>> list(filter(emoji.match, ['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']))
['', '⚓️']
N.B.: The pattern is an approximation and may capture some additional characters.
Extracting Emojis from a dataframe
Something like the demoji library might help.
Accurately find or remove emojis from a blob of text using data from the Unicode Consortium's emoji code repository.
Related Topics
Datetime Dtypes in Pandas Read_Csv
Scraping Dynamic Content Using Python-Scrapy
Dynamically Add Field to a Form
What Does a . in an Import Statement in Python Mean
Scatter Plot and Color Mapping in Python
Python: Importing a Sub‑Package or Sub‑Module
How to Compute the Intersection Point of Two Lines
How to Sort Unicode Strings Alphabetically in Python
Round to 5 (Or Other Number) in Python
Unicodedecodeerror: 'Ascii' Codec Can't Decode Byte 0Xef in Position 1
Difference Between Two Dates in Python
Start a Function at Given Time
"Importerror: No Module Named Site" on Windows
How to Find the Last Occurrence of an Item in a Python List
Break // in X Axis of Matplotlib
Repeating Elements of a List N Times
How to Copy an Entire Directory of Files into an Existing Directory Using Python