Using Regex to Find All Phrases That Are Completely Capitalized

Regex to match only uppercase words with some exceptions

To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses \b for word boundaries. In the last example, it also uses negative lookaround (?<!) and (?!) as well as non-capturing parentheses (?:)

Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use

\b[A-Z]+[0-9]+\b

For all-uppercase and numbers (total must be 2 or more):

\b[A-Z0-9]{2,}\b

For all-uppercase and numbers, but starting with at least one letter:

\b[A-Z][A-Z0-9]+\b

The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:

(?:(?<!^)[A-Z]\b|(?<!^[A-Z0-9 ]*)\b[A-Z0-9]+\b(?![A-Z0-9 ]$))

breakdown:

The regex starts with (?:. The ?: signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).

Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol |. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.

Now, let's look at each expression in the alternation.

The first expression is: (?<!^)[A-Z]\b. The main clause here is [A-Z]\b, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is (?<!^), which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is (?<!x), where x is the expression that must not exist before our main clause. Here that expression is simply ^, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."

Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.

That is handled by a relatively small portion of the second expression in the alternation: \b[A-Z0-9]+\b. The \bs represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

The rest of the expression consists of other lookarounds. (?<!^[A-Z0-9 ]*) is another negative lookbehind, where the expression is ^[A-Z0-9 ]*. This means what precedes must not be all capital letters and numbers.

The second lookaround is (?![A-Z0-9 ]$), which is a negative lookahead. This means what follows must not be all capital letters and numbers.

So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.

There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.

It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.

Regex: find capitalized words

To extract the words into an array:

var allCapWords = str.match(/\b[A-Z]+\b/g);
-> ["STRING", "WORDS"]

(Here's a Regex101 test with your string.)

To pull the last word:

var lastCapWord = allCapWords[allCapWords.length - 1];
-> "WORDS"

How to find all words with first letter as upper case using Python Regex

You can use a word boundary instead of the anchors ^ and $

\b[A-Z]\w*

Regex demo

Note that if you use matches.append, you add an item to the list and re.findall returns a list, which will give you a list of lists.

import re

matches = []
regex = r"\b[A-Z]\w*"
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
for line in f:
matches += re.findall(regex, line)
print(matches)

Output

['Hi', 'How', 'You']

If there should be a whitespace boundary to the left, you could also use

(?<!\S)[A-Z]\w*

Regex demo


If you don't want to match words using \w with only uppercase chars, you could use for example a negative lookahead to assert not only uppercase chars till a word boundary

\b[A-Z](?![A-Z]*\b)\w*
  • \b A word boundary to prevent a partial match
  • [A-Z] Match an uppercase char A-Z
  • (?![A-Z]*\b) Negative lookahead, assert not only uppercase chars followed by a word boundary
  • \w* Match optional word chars

Regex demo


To match a word that starts with an uppercase char, and does not contain any more uppercase chars:

\b[A-Z][^\WA-Z]*\b
  • \b A word boundary
  • [A-Z] Match an uppercase char A-Z
  • [^\WA-Z]* Optionally match a word char without chars A-Z
  • \b A word boundary

Regex demo

Regular expression to find a series of uppercase words in a string

You may use this regex:

\b[A-Z]+(?:\s+[A-Z]+)*\b

RegEx Demo

RegEx Details:

  • \b: Word boundary
  • [A-Z]+: Match a word comprising only uppercase letters
  • (?:\s+[A-Z]+)*: Match 1+ whitespace followed by another word with uppercase letters. Match this group 0 or more times
  • \b: Word boundary

Code:

>>> s = 'This is a TEXT CONTAINING UPPER CASE WORDS and lower case words. This is a SECOND SENTENCE';
>>> print (re.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b', s))
['TEXT CONTAINING UPPER CASE WORDS', 'SECOND SENTENCE']

Regex to match uppercase Expressions and Words

You can simply use

\b[A-Z]+(?:\s+[A-Z]+)*\b

See regex demo

I added (?:\s+[A-Z]+)* to the regex to match 0 or more sequences of:

  • \s+ - 1 or more whitespace
  • [A-Z]+ - 1 or more characters from A-Z range.

enter image description here

Note that in case you need to match Unicode uppercase letters, use \p{Lu} instead of [A-Z] (it will also match accented letters):

\b\p{Lu}+(?:\s+\p{Lu}+)*\b

How do I get all words that begin with a capital letter following a specific string?

How do you like /Name ((?:[A-Z]\w+[ -]?)+)/?

Regex101: https://regex101.com/r/BFJBpZ/1

Regex to find words starting with capital letters not at beginning of sentence

You may use the following expression:

(?<!^)(?<!\. )[A-Z][a-z]+

Regex demo here.


import re
mystr="This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence."

print(re.findall(r'(?<!^)(?<!\. )[A-Z][a-z]+',mystr))

Prints:

['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']

REGEX to find the first one or two capitalized words in a string

You may use

^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?

See the regex demo

Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.

Details

  • ^ - start of string
  • [A-Z] - an uppercase ASCII letter
  • [-a-zA-Z]* - zero or more ASCII letters / hyphens
  • (?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:

    • \s+ - 1+ whitespace
    • [A-Z] - an uppercase ASCII letter
    • [-a-zA-Z]* - zero or more ASCII letters / hyphens

A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):

^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?

where \p{L} matches any letter and \p{Lu} matches any uppercase letter.



Related Topics



Leave a reply



Submit