Standalone Numbers Regex

Standalone numbers Regex?

Using lookaround, you can restrict your capturing to only digits which are not surrounded by other digits or decimal points:

(?<![0-9.])(\d+)(?![0-9.])

Alternatively, if you want to only match stand-alone numbers (e.g. if you don't want to match the 123 in abc123def):

(?<!\S)\d+(?!\S)

Regex for the first standalone number

Update #2

Your regex has redundant parts that you could remove them. E.g s|^([^.]+).*$|\1| that does replace a line with itself. If you are sure there is only one number as such in your string below regex is enough otherwise check the other solutions to capture the first one:

sed -r "s/^.* ([0-9]+) .*/\1/"

Simulating lazy version (preferred way):

  1. POSIX ERE (using -r option)

This works like greedy version except it is a must if your string may have more than one occurrence of such numbers.

Regex:

 ([0-9]+) .*|.

Usage:

$ sed -r "s/ ([0-9]+) .*|./\1/g" <<< " 54 foo 43 "
54

  1. POSIX BRE

If you want to go with the oldest regex flavor still in use (POSIX BRE) then this is your choice. This works the same as above regex but written in BRE.

Regex:

\(\( \([0-9]*\) .*\)*.\)*

Usage:

$ sed "s/\(\( \([0-9]*\) .*\)*.\)*/\3/g" <<< " 54 foo 43 "
54

In lazy versions, global g modifier should be set.

Getting standalone numbers and not numeric-related codes

The obvious way to do it is this: (?<!AC)\d+ - a bunch of digits that is not preceded by AC. However, that fails, because it matches 0001234, as it is preceded by 0, and not AC. The missing piece is that you have to assert also that it is not preceded by a digit:

(?<!AC)(?<!\d)\d+

Depending on the possible input strings, a word boundary assertion can also do a similar job:

(?<!AC)\b\d+

Your code ((?<!AC\d{8})\d+) fails because it means "a bunch of digits not preceded by ACXXXXXXXX (where X is a digit). AC00001234 is not preceded by AC and eight more digits, so it is a match. You could kind of fix it by asserting it after the match: \d+(?<!AC\d{8}), but that fails for a similar reason - it will disqualify 00001234, but it does not disqualify 0000123, because there is no AC and eight digits in front of its end - only seven! so you still need a boundary assertion:

\d+(?<!AC\d{8})\b

However, this is less clear than the first two solutions (and also requires you to know the length of the ACXXXXXXXX string).

Regex to identify standalone numbers

Use the Replace method of the RegExp object:

RE.Global = True   
RE.Pattern = "\b\d+(\s|$)"
result = RE.Replace(addr, "") ' Remove all matches from string

Stata Regex for 'standalone' numbers in string

Following up on the loop suggesting from the comments, you could do something like the following:

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings

split string, gen(part) parse(" ") // split string at space (p.s. space is the default)

gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}

drop part* N_words

where the result would be

. list

+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+

Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Regex Python Extract number

without regexp

text = ['C1412DRE, New York 2695','Direction 12','Main Street 6254 C13D']
str = ' '.join(text)
[int(s) for s in str.split() if s.isdigit()]
[2695, 12, 6254]

with regexp:

import re
re.findall(r'\b\d+\b', str)
['2695', '12', '6254']

and convert them to digits

[int(s) for s in re.findall(r'\b\d+\b', str)]
[2695, 12, 6254]

https://docs.python.org/3/library/re.html

The great playgroud where you may try your regexp with codegen: https://regex101.com/r/4kUHhq/1

Regular expression to include numbers but not others

You can try something like so: \b(?<!-)\d+(?!-)\b. This will basically look for numbers which aren't preceded by a - and not followed by a - by using a negative look behind and negative look ahead.

Example here.

Note: The \b is there to ensure that given 12-34, the expression does not match 1 (since it is not followed by a -) and 4 (since it is not preceded by a -).

Python regex match only if standalone

Looks like a perfect job for Negative Lookbehind and Negative Lookahead:

re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''', 
'@percent@', string, flags=re.VERBOSE)

(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).

(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".

Demo: https://regex101.com/r/khV7MZ/1.



Related Topics



Leave a reply



Submit