Python Strip Hyphen from Block of String

Python strip hyphen from block of string

string = """"(CB)-year-(3F)-year-
(56)-ADDR(01)-DATA(06)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-
(56)-ADDR(01)-DATA(03)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-
(05)-ADDR5-[address0]-(E0)-tWHR2-nintK-
(56)-ADDR(01)-DATA(05)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-"""

Your string above ending with - and in python the complete string is considered as single string not different one hence all other hyphens are not considered as end of string in .endswith()

strings are just separated by new line \n so you need to split first and join them as below:

In [12]: print('\n'.join([i[:-1] if i[-1] == '-' else i for i in string.split('\n')]))
"(CB)-year-(3F)-year
(56)-ADDR(01)-DATA(06)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new
(56)-ADDR(01)-DATA(03)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new
(05)-ADDR5-[address0]-(E0)-tWHR2-nintK
(56)-ADDR(01)-DATA(05)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new

Logic:

'\n'.join(...) join all the string iterables with \n

i[:-1] gives string without last character

i[-1] == '-' checks if last character of string ending with hyphen - or not

string.split('\n') splits your string with separator \n results in a list of string which is iterated in list comprehension


Time comparison:

In [18]: %timeit re.sub('-:?$', '', string, flags=re.MULTILINE)
2.74 µs ± 91.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [19]: %timeit '\n'.join([i[:-1] if i[-1] == '-' else i for i in string.split('\n')])
1.56 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How can I remove all trailing dashes from a string?

string1.rstrip("-")
# "title"
string2.rstrip("-")
# "title"
string3.rstrip("-")
# "title-is-a-title"

Remove selective hyphenations / punctuations based on list of exceptions

You could try this:

S = s.str.split(expand=True).T[0]
' '.join(np.where(S.isin(list_to_keep), S, S.str.replace('-', '')))

Output:

'do not-remove this-hyphen but removeall of thesehyphens'

How it works.

  • Create a pd.Series, S, using the string access and split then transpose
    dataframe and get the first column
  • Use np.where to get only those terms that aren't in the list and use
    replace to remove the hyphen otherwise return the original term.
  • Use join to reconstruct the string from the terms in updated pd.Series, S.

How to split a string based on either a colon or a hyphen?

To split on more than one delimiter, you can use re.split and a character set:

import re
re.split('[-:]', a)

Demo:

>>> import re
>>> a = '4-6'
>>> b = '7:10'
>>> re.split('[-:]', a)
['4', '6']
>>> re.split('[-:]', b)
['7', '10']

Note however that - is also used to specify a range of characters in a character set. For example, [A-Z] will match all uppercase letters. To avoid this behavior, you can put the - at the start of the set as I did above. For more information on Regex syntax, see Regular Expression Syntax in the docs.

Replace all hyphens except in between two digits

You may use

import re
p = re.compile(r'(?!(?<=\d)-\d)-')
test_str = "12345-4567 hello-you 45-year N-45"
print(re.sub(p, " ", test_str))
# => 12345-4567 hello you 45 year N 45

See the Python demo and the regex demo.

The (?!(?<=\d)-\d)- regex matches a

  • (?!(?<=\d)-\d) - a location in a string that is not immediately followed with a - (that is immediately preceded with a digit) followed with a digit
  • - - a hyphen.

Another approach is to match and capture postal code like strings to keep them and replace - in all other contexts:

re.sub(r'\b(\d{5}-\d{4})\b|-', r'\1 ', text)

See the regex demo and the Python demo.

Note \b(\d{5}-\d{4})\b matches and captures into Group 1 a word boundary position first, then matches any five digits, a hyphen, four digits and again a word boundary. The \1 backreference in the replacement pattern refers to the value captured in Group 1.

Remove character from the middle of a string

I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.

import pysam
import sys
from re import sub

# Provide a bam file
if len(sys.argv) == 2:
assert sys.argv[1].endswith('.bam')

# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)

inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)

# Counters for reads processed and written
n = 0
w = 0

# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
n += 1
umi = read.get_tag('RX')
assert umi is not None
umifix = umi[:6] + umi[7:]
read.set_tag('RX', umifix, value_type='Z')
if '-' in umifix:
print('Hyphen found in UMI:', umifix, read)
break
else:
w += 1
outbam.write(read)

inbam.close()
outbam.close()

print ('Processed', n, 'reads:\n',
w, 'UMIs written.\n',
str(int((w / n) * 100)) + '% of UMIs fixed')

Remove the dash based on the number of characters

Use a lookbehind and lookahead:

(?<=\S\S)-(?=\S\S)

To match a dash (hyphen) that is preceded and followed by exactly 2 non-whitespace characters.

RegEx Demo

Code:

>>> import re
>>> reg = re.compile(r'(?<=\S\S)-(?=\S\S)')

>>> reg.sub(' ', 'test 10 MF-MT this FOR test')
'test 10 MF MT this FOR test'

>>> reg.sub(' ', 'test 10 M-M this FOR test')
'test 10 M-M this FOR test'

Remove words with spaces or "-" in them Python

The problem with your regex is grouping. Using (-)?|( )? as a separator does not do what you think it does.

Consider what happens when the list of words is a,b:

>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'

You'd like this regex to match ab or a b or a-b, but clearly it does not do that. It matches a, a-, b or <space>b instead!

>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>

To fix this you can enclose the separator in its own group: ([- ])?.

If you also want to match words like wonder - land (i.e. where there are spaces before/after the hyphen) you should use the following (\s*-?\s*)?.

How to remove everything before certain character in Python

Assuming I understand this correctly, there are two ways to do this that come to mind:

Including both, since I might not understand this correctly, and for completeness reasons. I think the split/parts solution is cleaner, particularly when the 'certain character' is a dot.

>>> msg = r'C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1'

>>> re.search(r'.*(..\..*)', msg).group(1)
'S8.1'

>>> parts = msg.split('.')
>>> ".".join((parts[-2][-2:], parts[-1]))
'S8.1'


Related Topics



Leave a reply



Submit