Removing Emojis from a String in Python

removing emojis from a string in Python

I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.

#!/usr/bin/env python
import re

text = u'This is a smiley face \U0001f602'
print(text) # with emoji

def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)

print(deEmojify(text))

This was my previous answer, do not use this.

def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')

Remove emoji from string doesn't works for some cases

check out this answer, the emoji python package seems like the best way to solve this problem.

to convert any emoji/character into UTF-8 do this:

import emoji
s = ''
print(s.encode('unicode-escape').decode('ASCII'))

it'd print \U0001f600

Remove Emoji's from multilingual Unicode text

The regex is outdated. It appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0). The other approach is just a very inefficient method of force-encoding to ASCII, which is rarely what you want when just removing Emoji (and can be much more easily and efficiently achieved with text.encode('ascii', 'ignore').decode('ascii')).

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

Demo using your sample inputs:

>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ '))
తెలంగాణ రియల్ ఎస్టేట్
>>> print(remove_emoji(u'Testరియల్ ఎస్టేట్ A.P&T.S. '))
Testరియల్ ఎస్టేట్ A.P&T.S.

Note that the regex works on Unicode text, for Python 2 make sure you have decoded from str to unicode, for Python 3, from bytes to str first.

Emoji are complex beasts these days. The above will remove complete, valid Emoji. If you have 'incomplete' Emoji components such as skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER that have specific meaning in other contexts too, and you can't just regex those out, not unless you don't mind breaking text using Devanagari, or Kannada or CJK ideographs, to name just a few examples.

That said, the following Unicode 11.0 codepoints are probably safe to remove (based on filtering the Emoji_Component Emoji-data designation):

20E3          ;  (⃣)     combining enclosing keycap
FE0F ; () VARIATION SELECTOR-16
1F1E6..1F1FF ; (..) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; (..) light skin tone..dark skin tone
1F9B0..1F9B3 ; (..) red-haired..white-haired
E0020..E007F ; (..) tag space..cancel tag

which can be removed by creating a new regex to match those:

import re
try:
uchr = unichr # Python 2
import sys
if sys.maxunicode == 0xffff:
# narrow build, define alternative unichr encoding to surrogate pairs
# as unichr(sys.maxunicode + 1) fails.
def uchr(codepoint):
return (
unichr(codepoint) if codepoint <= sys.maxunicode else
unichr(codepoint - 0x010000 >> 10 | 0xD800) +
unichr(codepoint & 0x3FF | 0xDC00)
)
except NameError:
uchr = chr # Python 3

# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
(0x20E3, 0xFE0F), # combining enclosing keycap, VARIATION SELECTOR-16
range(0x1F1E6, 0x1F1FF + 1), # regional indicator symbol letter a..regional indicator symbol letter z
range(0x1F3FB, 0x1F3FF + 1), # light skin tone..dark skin tone
range(0x1F9B0, 0x1F9B3 + 1), # red-haired..white-haired
range(0xE0020, 0xE007F + 1), # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
flags=re.UNICODE)

then update the above remove_emoji() function to use it:

def remove_emoji(text, remove_components=False):
cleaned = emoji.get_emoji_regexp().sub(u'', text)
if remove_components:
cleaned = emoji_components.sub(u'', cleaned)
return cleaned

Remove emojis and @users from a list in Python and punctuation, NLP problem, and my emoji function does not work

I'm drawing on some other SO answers here:

  • removing textual emojis: https://stackoverflow.com/a/61758471/42346
  • removing graphical emojis: https://stackoverflow.com/a/50602709/42346

This will also remove any Twitter username wherever it appears in the string.

import emoji
import spacy
import stop_words

nlp = spacy.load('en_core_web_sm')

stopwords = [w.lower() for w in stop_words.get_stop_words('en')]

emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""

def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text)

def sanitize(string):
""" Sanitize one string """

# remove graphical emoji
string = give_emoji_free_text(string)

# remove textual emoji
string = re.sub(emoticon_string,'',string)

# normalize to lowercase
string = string.lower()

# spacy tokenizer
string_split = [token.text for token in nlp(string)]

# in case the string is empty
if not string_split:
return ''

# join back to string
string = ' '.join(string_split)

# remove user
# assuming user has @ in front
string = re.sub(r"""(?:@[\w_]+)""",'',string)

#remove # and @
for punc in '":!@#':
string = string.replace(punc, '')

# remove 't.co/' links
string = re.sub(r'http//t.co\/[^\s]+', '', string, flags=re.MULTILINE)

# removing stop words
string = ' '.join([w for w in string.split() if w not in stopwords])

return string

How can I remove emojis from a dataframe?

This would be the equivalent code for pandas. It operates column by column.

df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

cleaning a string by removing emoticons

Try this:

import re
import string
s = '''Hi !こんにちは、私の給料は月額10000ドルです。 XO XO
私はあなたの料理が大好きです
私のフライトはAPX1999です。br>私はサッカーの試合を見るのが大好きです。
'''
# replace all ascii chars 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
replaced = re.sub(f'[{string.printable}]', '', s)

emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]", flags=re.UNICODE)

replaced = re.sub(emoji_pattern, '', replaced)

print(replaced)


Related Topics



Leave a reply



Submit