Regex for Existence of Some Words Whose Order Doesn't Matter

Regex for existence of some words whose order doesn't matter

See this regex:

/^(?=.*Tim)(?=.*stupid).+/

Regex explanation:

  • ^ Asserts position at start of string.
  • (?=.*Tim) Asserts that "Tim" is present in the string.
  • (?=.*stupid) Asserts that "stupid" is present in the string.
  • .+Now that our phrases are present, this string is valid. Go ahead and use .+ or - .++ to match the entire string.

To use lookaheads more exclusively, you can add another (?=.*<to_assert>) group. The entire regex can be simplified as /^(?=.*Tim).*stupid/.

See a regex demo!

>>> import re
>>> str ="""
... Tim is so stupid.
... stupid Tim!
... Tim foobar barfoo.
... Where is Tim?"""
>>> m = re.findall(r'^(?=.*Tim)(?=.*stupid).+$', str, re.MULTILINE)
>>> m
['Tim is so stupid.', 'stupid Tim!']
>>> m = re.findall(r'^(?=.*Tim).*stupid', str, re.MULTILINE)
>>> m
['Tim is so stupid.', 'stupid Tim!']

Read more:

  • Regex with exclusion chars and another regex

Regex for matching adjacent words where order doesn't matter

There is no strictly better solution, but there's an alternative.

Now, if you have two normal words like "fat" and "cat", then (fat cat|cat fat) is undoubtedly the best solution. But what if you have 5 words? Or if you have more complex patterns than just fat and cat that you don't want to type twice?

Say instead of fat and cat you have 3 regex patterns A, B and C, and instead of the space between fat and cat you have the regex pattern S. In that case, you could use this recipe:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}

If you don't have an S, this can be simplified to

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}

(Note: (?:X) can be simplified to X if X doesn't contain an alternation |.)

Example

If we set A = fat, B = cat and S = space, we get:

(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}

Try it online.


Explanation

In essence, we're using capture groups to "remember" which patterns have already matched. To do so, we use this little pattern here:

(?!\1)()some_pattern

What does this do? It's a regex that matches exactly once. Once it has matched, it won't ever match again. If you try to add a quantifier around that pattern like (?:(?!\1)()some_pattern)* it'll match either once or won't match at all.

The trick there is the usage of a backreference to a capture group before that group has even been defined. Because capture groups are initialized with a "failed to match" state, the negative lookahead (?!\1) will match successfully - but only the first time. Because right afterwards, the capture group () matches and captures the empty string. From this point forward, the negative lookahead (?!\1) will never match again.

With this as a building block, we can create a regex that matches fatcat and catfat while only containing the words fat and cat once:

(?:(?!\1)()fat|(?!\2)()cat){2}

Because of the negative lookaheads, each word can only match at most once. Adding a {2} quantifier at the end guarantees that each of the two words matches exactly once, or the entire match fails.

Now we just need to find a way to match a space between fat and cat. Well, that's just a slight variation of the same pattern:

(?:(?!\1)()|\1 )

This pattern will match the empty string on its first match, and on each subsequent match it'll match a space.

Put it all together, and voilà:

(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}

Templates (for the lazy)

2 patterns A and B, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B))){2}

3 patterns A, B and C, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}

4 patterns A, B, C and D, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C)|(?!\5)()(?:D))){4}

2 patterns A and B, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)){2}

3 patterns A, B and C, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}

4 patterns A, B, C and D, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)|(?!\4)()(?:D)){4}

Match a string between two or more words regardless of order

You may use a backreference + a subroutine:

\b(longword1|longword2)\b.*?\b(?!\1\b)(?1)\b

Expanding it for three alternatives:

\b(longword1|longword2|longword3)\b.*?\b(?!\1\b)((?1))\b.*?\b(?!(?:\1|\2)\b)(?1)\b

See the regex demo and this regex demo, too. So, the list of words will be in Group 1, and you will only need to add backreferences before the subsequent subroutines.

Details

  • \b(longword1|longword2)\b - a whole word, either longword1 or longword2
  • .*? - any 0 or more chars other than line break chars, as few as possible
  • \b - a word boundary
  • (?!\1\b) - there cannot be the same text as matched in Group 1 followed with a word boundary
  • (?1) - a subroutine that matches the same pattern as in Group 1
  • \b - a word boundary

JavaScript - regex order doesn't matter but existence required

Just use two sequential RegExps, like that:

var body = '<link rel="stylesheet" href="my.css"/> <link href="https://support.google.com/recaptcha/?hl=en" rel="canonical"/> <a href="https://www.google.com/accounts/TOS"/>'
var linkRegexp = /(<link[^>]*rel=['"]canonical['"][^>]*>)/;
var hrefRegexp = /href=['"](.*?)['"]/;

var linkBody = linkRegexp.exec(body)[1];
console.log(hrefRegexp.exec(linkBody)[1]);
  • linkRegexp - get the link with rel='canonical'
  • hrefRegexp - extract href from it

If you want just one regexp, you can try to use the alternative groups, and choose the non-empty match, like this:

var regexp = /<link[^>]*(?=href=['"]([^'"]*)['"][^>]*?rel=['"]canonical['"]|rel=['"]canonical[^>]*?href=['"]([^'"]*)['"])[^>]*>/;
console.log( regexp.exec(body).splice(1).join(""));

(but IMHO this is much less readable)

Find unordered words with RegEx

I think this task best gets done with some programming logic and regex wouldn't be easy and efficient. But here is a regex that seems to be doing your job and doesn't matter whether you have repeating words (hello my world) present or not,

\b(hello|my|world)\b.*?((?!\1)\b(?:hello|my|world)\b).*?(?:(?!\1)(?!\2)\b(?:hello|my|world)\b)

The idea here is,

  1. Make an alternation group \b(hello|my|world)\b and put it in group1
  2. Then optionally it can have zero or more any characters following it.
  3. Then it must be followed by any of the remaining two words and not the one that got matched in first group which is why I have used ((?!\1)\b(?:hello|my|world)\b) and this second match is then put in group 2.
  4. Then again it can have optionally zero or more any characters following it.
  5. Then again we apply the same logic where the third word should be the one that wasn't captured in either group1 or group2, hence this regex (?:(?!\1)(?!\2)\b(?:hello|my|world)\b)

Here is a Demo

Regex unordered matches

You don't really want to use a regex for this unless the text is very small, which from your description I doubt.

A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.

Regex: I want this AND that AND that... in any order

You can use (?=…) positive lookahead; it asserts that a given pattern can be matched. You'd anchor at the beginning of the string, and one by one, in any order, look for a match of each of your patterns.

It'll look something like this:

^(?=.*one)(?=.*two)(?=.*three).*$

This will match a string that contains "one", "two", "three", in any order (as seen on rubular.com).

Depending on the context, you may want to anchor on \A and \Z, and use single-line mode so the dot matches everything.

This is not the most efficient solution to the problem. The best solution would be to parse out the words in your input and putting it into an efficient set representation, etc.

Related questions

  • How does the regular expression (?<=#)[^#]+(?=#) work?

More practical example: password validation

Let's say that we want our password to:

  • Contain between 8 and 15 characters
  • Must contain an uppercase letter
  • Must contain a lowercase letter
  • Must contain a digit
  • Must contain one of special symbols

Then we can write a regex like this:

^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!@#$%^&*]).*$
\__________/\_________/\_________/\_________/\______________/
length upper lower digit symbol

Building a RegEx, 12 letters without order, fixed number of individual letters

One possibility, although I still think regex is inappropriate for this. Checks that all letters appear the desired amount and that it's 12 letters total (so there's no room left for any more/other letters):

import re

for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch('(?=.*W)(?=.*S)(?=.*I)(?=.*O)'
'(?=.*B)(?=.*A)(?=.*R)(?=.*H)'
'(?=.*T.*T)(?=.*E.*E).{12}', s))

Another, checking that none other than T and E appear twice, that none appear thrice, and that we have only the desired letters, 12 total:

import re

for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch(r'(?!.*([^TE]).*\1)'
r'(?!.*(.).*\1.*\1)'
r'[WSIOBARHTE]{12}', s))

A simpler way:

for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(sorted(s) == sorted('WSIOBARHTTEE'))


Related Topics



Leave a reply



Submit