Python strip hyphen from block of string
string = """"(CB)-year-(3F)-year-
(56)-ADDR(01)-DATA(06)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-
(56)-ADDR(01)-DATA(03)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-
(05)-ADDR5-[address0]-(E0)-tWHR2-nintK-
(56)-ADDR(01)-DATA(05)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new-"""
Your string above ending with -
and in python the complete string is considered as single string not different one hence all other hyphens are not considered as end of string in .endswith()
strings are just separated by new line \n
so you need to split first and join them as below:
In [12]: print('\n'.join([i[:-1] if i[-1] == '-' else i for i in string.split('\n')]))
"(CB)-year-(3F)-year
(56)-ADDR(01)-DATA(06)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new
(56)-ADDR(01)-DATA(03)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new
(05)-ADDR5-[address0]-(E0)-tWHR2-nintK
(56)-ADDR(01)-DATA(05)-(00)-ADDR5-PBX-CHX-[address0]-(CA)-new
Logic:
'\n'.join(...)
join all the string iterables with \n
i[:-1]
gives string without last character
i[-1] == '-'
checks if last character of string ending with hyphen -
or not
string.split('\n')
splits your string with separator \n
results in a list of string which is iterated in list comprehension
Time comparison:
In [18]: %timeit re.sub('-:?$', '', string, flags=re.MULTILINE)
2.74 µs ± 91.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [19]: %timeit '\n'.join([i[:-1] if i[-1] == '-' else i for i in string.split('\n')])
1.56 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
How can I remove all trailing dashes from a string?
string1.rstrip("-")
# "title"
string2.rstrip("-")
# "title"
string3.rstrip("-")
# "title-is-a-title"
Remove selective hyphenations / punctuations based on list of exceptions
You could try this:
S = s.str.split(expand=True).T[0]
' '.join(np.where(S.isin(list_to_keep), S, S.str.replace('-', '')))
Output:
'do not-remove this-hyphen but removeall of thesehyphens'
How it works.
- Create a pd.Series, S, using the string access and split then transpose
dataframe and get the first column - Use np.where to get only those terms that aren't in the list and use
replace to remove the hyphen otherwise return the original term. - Use join to reconstruct the string from the terms in updated pd.Series, S.
How to split a string based on either a colon or a hyphen?
To split on more than one delimiter, you can use re.split
and a character set:
import re
re.split('[-:]', a)
Demo:
>>> import re
>>> a = '4-6'
>>> b = '7:10'
>>> re.split('[-:]', a)
['4', '6']
>>> re.split('[-:]', b)
['7', '10']
Note however that -
is also used to specify a range of characters in a character set. For example, [A-Z]
will match all uppercase letters. To avoid this behavior, you can put the -
at the start of the set as I did above. For more information on Regex syntax, see Regular Expression Syntax in the docs.
Replace all hyphens except in between two digits
You may use
import re
p = re.compile(r'(?!(?<=\d)-\d)-')
test_str = "12345-4567 hello-you 45-year N-45"
print(re.sub(p, " ", test_str))
# => 12345-4567 hello you 45 year N 45
See the Python demo and the regex demo.
The (?!(?<=\d)-\d)-
regex matches a
(?!(?<=\d)-\d)
- a location in a string that is not immediately followed with a-
(that is immediately preceded with a digit) followed with a digit-
- a hyphen.
Another approach is to match and capture postal code like strings to keep them and replace -
in all other contexts:
re.sub(r'\b(\d{5}-\d{4})\b|-', r'\1 ', text)
See the regex demo and the Python demo.
Note \b(\d{5}-\d{4})\b
matches and captures into Group 1 a word boundary position first, then matches any five digits, a hyphen, four digits and again a word boundary. The \1
backreference in the replacement pattern refers to the value captured in Group 1.
Remove character from the middle of a string
I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.
import pysam
import sys
from re import sub
# Provide a bam file
if len(sys.argv) == 2:
assert sys.argv[1].endswith('.bam')
# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)
inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)
# Counters for reads processed and written
n = 0
w = 0
# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
n += 1
umi = read.get_tag('RX')
assert umi is not None
umifix = umi[:6] + umi[7:]
read.set_tag('RX', umifix, value_type='Z')
if '-' in umifix:
print('Hyphen found in UMI:', umifix, read)
break
else:
w += 1
outbam.write(read)
inbam.close()
outbam.close()
print ('Processed', n, 'reads:\n',
w, 'UMIs written.\n',
str(int((w / n) * 100)) + '% of UMIs fixed')
Remove the dash based on the number of characters
Use a lookbehind and lookahead:
(?<=\S\S)-(?=\S\S)
To match a dash (hyphen) that is preceded and followed by exactly 2 non-whitespace characters.
RegEx Demo
Code:
>>> import re
>>> reg = re.compile(r'(?<=\S\S)-(?=\S\S)')
>>> reg.sub(' ', 'test 10 MF-MT this FOR test')
'test 10 MF MT this FOR test'
>>> reg.sub(' ', 'test 10 M-M this FOR test')
'test 10 M-M this FOR test'
Remove words with spaces or "-" in them Python
The problem with your regex is grouping. Using (-)?|( )?
as a separator does not do what you think it does.
Consider what happens when the list of words is a,b
:
>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'
You'd like this regex to match ab
or a b
or a-b
, but clearly it does not do that. It matches a
, a-
, b
or <space>b
instead!
>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>
To fix this you can enclose the separator in its own group: ([- ])?
.
If you also want to match words like wonder - land
(i.e. where there are spaces before/after the hyphen) you should use the following (\s*-?\s*)?
.
How to remove everything before certain character in Python
Assuming I understand this correctly, there are two ways to do this that come to mind:
Including both, since I might not understand this correctly, and for completeness reasons. I think the split/parts solution is cleaner, particularly when the 'certain character' is a dot.
>>> msg = r'C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1'
>>> re.search(r'.*(..\..*)', msg).group(1)
'S8.1'
>>> parts = msg.split('.')
>>> ".".join((parts[-2][-2:], parts[-1]))
'S8.1'
Related Topics
How to Wait Until I Receive Data Using a Python Socket
Typeerror: Strptime() Argument 1 Must Be Str, Not List
Python Pip Install Error [Ssl: Certificate_Verify_Failed]
How to Correct Typeerror: Unicode-Objects Must Be Encoded Before Hashing
Python Super :Typeerror: _Init_() Takes 2 Positional Arguments But 3 Were Given
Numpy Distance Calculations of Different Shaped Arrays
Adding Months to a Pandas Object in Python
Filtering Dataframe Using the Length of a Column
Using Continue in a Try and Except Inside While-Loop
Package Only Binary Compiled .So Files of a Python Library Compiled With Cython
How to Use Ffmpeg in a Python Function
Most Efficient Way to Forward-Fill Nan Values in Numpy Array
How to Print Colored Text to the Terminal
How to Ignore Null Byte When Reading a CSV File
Render_Template in Python-Flask Is Not Working
Pandas - Calculate Average of Columns With Condition Based on Values in Other Columns
Valueerror: Too Many Values to Unpack (Expected 2) in Django