Remove Punctuation from Unicode Formatted Strings

Remove punctuation from Unicode formatted strings

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
return text.translate(tbl)

You could also use r'\p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
return re.sub(ur"\p{P}+", "", text)

Strip special characters and punctuation from a unicode string

< and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:

regex.sub('[\p{P}\p{Sm}]+', '', text)

The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

That only removes only the limited number of ASCII punctuation characters.

Demo:

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

(For Python 3, replace unicode with str, and print ... with print(...)).

Remove selected punctuation from unicode strings

You can negate the \p{P} with \P{P} then put it in a negated character class ([^…]) along with whatever characters you want to keep, like this:

return regex.sub(ur"[^\P{P}-]+", " ", text)

This will match one or more of any character in \p{P} except those that are also defined inside the character class.

Remember that - is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.


Another solution would be to use a negative lookahead ((?!…)) or negative lookbehind ((?<!…))

return regex.sub(ur"((?!-)\p{P})+", " ", text)

return regex.sub(ur"(\p{P}(?<!-))+", " ", text)

But for something like this I'd recommend the character class instead.

remove punctuation from unicode: error

Try unichr instead of chr:

Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, unicodedata
>>> table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(unichr(i)).startswith('P'))
>>>

Remove all punctuation from string except full stop (.) and colon (:) in Python

you don't escape special characters in string.punctuation for your regex. also you forgot to replace :!

use re.escape to escape regex special characters in punctuation. your final pattern will be [\!\"\#\$\%\&\'\(\)\*\+\,\-\/\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~]

import string
import re
remove = string.punctuation

remove = remove.replace(".", "")
remove = remove.replace(":", "")

pattern = r"[{}]".format(re.escape(remove))

line = "NETWORK [listener] connection accepted from 127.0.0.1:59926 #4785 (3 connections now open)"
line = re.sub(pattern, "", line)

output:

NETWORK  listener connection accepted from 127.0.0.1:59926 4785 3 connections now open

Python regex, remove all punctuation except hyphen for unicode string

[^\P{P}-]+

\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.

Example: http://www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).

Working example: http://www.rubular.com/r/5G62iSYTdk

Best way to strip punctuation from a string

From an efficiency perspective, you're not going to beat

s.translate(None, string.punctuation)

For higher versions of Python use the following code:

s.translate(str.maketrans('', '', string.punctuation))

It's performing raw string operations in C with a lookup table - there's not much that will beat that but writing your own C code.

If speed isn't a worry, another option though is:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

This is faster than s.replace with each char, but won't perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.

Timing code:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
return ''.join(ch for ch in s if ch not in exclude)

def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)

def test_trans(s):
return s.translate(table, string.punctuation)

def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s

print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

This gives the following results:

sets      : 19.8566138744
regex : 6.86155414581
translate : 2.12455511093
replace : 28.4436721802


Related Topics



Leave a reply



Submit