Remove punctuation from Unicode formatted strings
You could use unicode.translate()
method:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
return text.translate(tbl)
You could also use r'\p{P}'
that is supported by regex module:
import regex as re
def remove_punctuation(text):
return re.sub(ur"\p{P}+", "", text)
Strip special characters and punctuation from a unicode string
<
and >
are classified as Math Symbols (Sm), not Punctuation (P). You can match either:
regex.sub('[\p{P}\p{Sm}]+', '', text)
The unicode.translate()
method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None
; None
removes that codepoint. Map string.punctuation
to codepoints with ord()
:
text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
That only removes only the limited number of ASCII punctuation characters.
Demo:
>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik
If string.punctuation
is not enough, then you can generate a complete str.translate()
mapping for all P
and Sm
codepoints by iterating from 0 to sys.maxunicode
, then test those values against unicodedata.category()
:
>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik
(For Python 3, replace unicode
with str
, and print ...
with print(...))
.
Remove selected punctuation from unicode strings
You can negate the \p{P}
with \P{P}
then put it in a negated character class ([^…]
) along with whatever characters you want to keep, like this:
return regex.sub(ur"[^\P{P}-]+", " ", text)
This will match one or more of any character in \p{P}
except those that are also defined inside the character class.
Remember that -
is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.
Another solution would be to use a negative lookahead ((?!…)
) or negative lookbehind ((?<!…)
)
return regex.sub(ur"((?!-)\p{P})+", " ", text)
return regex.sub(ur"(\p{P}(?<!-))+", " ", text)
But for something like this I'd recommend the character class instead.
remove punctuation from unicode: error
Try unichr
instead of chr
:
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, unicodedata
>>> table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(unichr(i)).startswith('P'))
>>>
Remove all punctuation from string except full stop (.) and colon (:) in Python
you don't escape special characters in string.punctuation
for your regex. also you forgot to replace :
!
use re.escape
to escape regex special characters in punctuation. your final pattern will be [\!\"\#\$\%\&\'\(\)\*\+\,\-\/\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~]
import string
import re
remove = string.punctuation
remove = remove.replace(".", "")
remove = remove.replace(":", "")
pattern = r"[{}]".format(re.escape(remove))
line = "NETWORK [listener] connection accepted from 127.0.0.1:59926 #4785 (3 connections now open)"
line = re.sub(pattern, "", line)
output:
NETWORK listener connection accepted from 127.0.0.1:59926 4785 3 connections now open
Python regex, remove all punctuation except hyphen for unicode string
[^\P{P}-]+
\P
is the complementary of \p
- not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is \p{P}(?<!-)
: match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk
Best way to strip punctuation from a string
From an efficiency perspective, you're not going to beat
s.translate(None, string.punctuation)
For higher versions of Python use the following code:
s.translate(str.maketrans('', '', string.punctuation))
It's performing raw string operations in C with a lookup table - there's not much that will beat that but writing your own C code.
If speed isn't a worry, another option though is:
exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)
This is faster than s.replace with each char, but won't perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.
Timing code:
import re, string, timeit
s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):
return ''.join(ch for ch in s if ch not in exclude)
def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)
def test_trans(s):
return s.translate(table, string.punctuation)
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)
This gives the following results:
sets : 19.8566138744
regex : 6.86155414581
translate : 2.12455511093
replace : 28.4436721802
Related Topics
Datetime Dtypes in Pandas Read_Csv
Converting a String to a List of Words
What Does -1 Mean in Numpy Reshape
Numpy "Where" with Multiple Conditions
Python Sharing a Lock Between Processes
How to Get the Utc Time of "Midnight" for a Given Timezone
What Are the Differences Between JSON and Simplejson Python Modules
How to Convert Integer Timestamp into a Datetime
Tkinter: Binding Mousewheel to Scrollbar
Pandas Groupby, Then Sort Within Groups
Remove and Replace Printed Items
Break // in X Axis of Matplotlib
How to Increase the Cell Width of the Jupyter/Ipython Notebook in My Browser
Find and Replace String Values in List
Plotting a 2D Heatmap with Matplotlib