Product Code Looks Like Abcd2343, How to Split by Letters and Numbers

Product code looks like abcd2343, how to split by letters and numbers?

import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)

> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']

Or, if you want to split on the first occurrence of a digit:

re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']


  • \d+ matches 1-or-more digits.
  • \d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
  • \d+|\D+ matches 1-or-more digits or 1-or-more non-digits.

Consult the docs for more about Python's regex syntax.


re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:

re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts

re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups

In contrast, re.findall(pat, s) returns only the parts of s that match pat:

re.findall('\d+', s)
> ['2343', '34324', '3243', '23']

Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):

s='abcd2343 abw34324 abc3243-23A 123'

re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']

re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']

Splitting letters from numbers within a string

Use itertools.groupby together with str.isalpha method:

Docstring:

groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).


Docstring:

S.isalpha() -> bool

Return True if all characters in S are alphabetic
and there is at least one character in S, False otherwise.


In [1]: from itertools import groupby

In [2]: s = "125A12C15"

In [3]: [''.join(g) for _, g in groupby(s, str.isalpha)]
Out[3]: ['125', 'A', '12', 'C', '15']

Or possibly re.findall or re.split from the regular expressions module:

In [4]: import re

In [5]: re.findall('\d+|\D+', s)
Out[5]: ['125', 'A', '12', 'C', '15']

In [6]: re.split('(\d+)', s) # note that you may have to filter out the empty
# strings at the start/end if using re.split
Out[6]: ['', '125', 'A', '12', 'C', '15', '']

In [7]: re.split('(\D+)', s)
Out[7]: ['125', 'A', '12', 'C', '15']

As for the performance, it seems that using a regex is probably faster:

In [8]: %timeit re.findall('\d+|\D+', s*1000)
100 loops, best of 3: 2.15 ms per loop

In [9]: %timeit [''.join(g) for _, g in groupby(s*1000, str.isalpha)]
100 loops, best of 3: 8.5 ms per loop

In [10]: %timeit re.split('(\d+)', s*1000)
1000 loops, best of 3: 1.43 ms per loop

Any way to split strings in Python at the place were an integer appears?

What about using regex? i.e., the re package in python, combined with the split method? Something like this could work:

import re
string = 'string01string02string23string4string500string'

strlist = re.split('(\d+)', string)
print(strlist)
['string', '01', 'string', '02', 'string', '23', 'string', '4', 'string', '500', 'string']

You would then need to combine every other element in the list in your case i think, so something like this:

cmb = [i+j for i,j in zip(strlist[::2], strlist[1::2])]
print(cmb)

['string01', 'string02', 'string23', 'string4', 'string500']

Converting string of letters and numbers into array

s = '2A3M4D8'
s = re.split('(\d+)', s)
s = list(filter(None, s))
print(s)

stack = []
res = 0
letter = ''

for x in s:
if x.isnumeric():
stack.append(int(x))
if letter != '':
print(stack)
if letter == 'M':
res = stack[0] * stack[1]
elif letter == 'A':
res = stack[0] + stack[1]
elif letter == 'D':
res = stack[0] / stack[1]
stack = []
print(res)
stack.append(res)
print(stack)
res = 0
letter = ''
else:
letter = x
print(stack[0])

Split alphanumeric strings by space and keep separator for just first occurence

Here is one way. We can use re.findall on the pattern [A-Za-z]+|[0-9]+, which will alternatively find all letter or all number words. Then, join that resulting list by space to get your output

inp = "Brijesh Tiwari810663 A14082014RGUBWA"
output = ' '.join(re.findall(r'[A-Za-z]+|[0-9]+', inp))
print(output) # Brijesh Tiwari 810663 A 14082014 RGUBWA

Edit: For your updated requirement, use re.sub with just one replacement:

inp = "Johnson12 is at club39"
output = re.sub(r'\b([A-Za-z]+)([0-9]+)\b', r'\1 \2', inp, 1)
print(output) # Johnson 12 is at club39

Split string when first occurence of a number

Try splitting on the first occurrence of [ ](?=\d):

text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)', text, 1)
print(parts)

This prints:

['MARIA APARECIDA', '99223-2000 / 98450-8026']

Note that the regex pattern used splits and consumes a single space, but does not consume the digit that follows (lookaheads do not advance the position in the input).

split string on numeric/non-numeric boundary

You can use the combination of positive lookahead & lookbehind in regex to determine the boundaries(delimiters) around which you can split the given string. Use:

import re

matches = re.split(r'(?<=\D)(?=\d)|(?<=\d)(?=\D)', string)

The resulting matches for the given strings will be,

['abc', '0', 'foo!bar'] # 'abc0foo!bar'
['100', '.', '200', '.', '300'] # '100.200.300'
['123'] # '123'
['foo'] # 'foo'

Explanation:

  1. Positive Lookbehind (?<=\D)

    • \D matches any character that's not a digit.
  2. Positive Lookahead (?=\d)

    • \d matches a digit (equal to [0-9])
  3. Positive Lookbehind (?<=\d)

    • \d matches a digit (equal to [0-9])
  4. Positive Lookahead (?=\D)

    • \D matches any character that's not a digit.

You can test the regular expression here.



Related Topics



Leave a reply



Submit