Regular Expression (regex). How to ignore or exclude everything in between?
For you example data you might use an alternation |
to match either one of the regexes in you question and then concatenate them.
Note that in your regex you could write (?:[a-z][a-z0-9_]*)
as [a-z][a-z0-9_]
and you don't have to escape the dot in a character class.
For example:
[0-9]{5,7}[a-z][a-z0-9_]*|-?\d*\.\d+(?![-+0-9.])
Regex demo
String regex = "[0-9]{5,7}[a-z][a-z0-9_]*|-?\\d*\\.\\d+(?![-+0-9.])";
String string = "142d 000781fe0000326f BPD false 65535 FSK_75 FSK_75 -51.984 -48";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = "";
while (matcher.find()) {
result += matcher.group(0);
}
System.out.println(result); // 000781fe0000326f-51.984
Demo Java
How to let regex ignore everything between brackets?
Try this
[^a-zA-Z {}]+(?![^{]*})
See it here on Regexr
Means match anything that is not included in the negated character class, but only if there is no closing bracket ahead without a opening before, this is done by the negative lookahead (?![^{]*})
.
$string preg_replace('/[^a-zA-Z {}]+(?![^{]*})/', '', $string);
Regex to ignore data between brackets
If you want to remove the {
, }
, ,
and "
not inside square brackets, you can use
re.sub(r'(\[[^][]*])|[{}",]', r'\1', s)
See the regex demo. Note you can add more chars to the character set, [{}"]
. If you need to add a hyphen, make sure it is the last char in the character set. Escape \
, ]
(if not the first, right after [
) and ^
(if it comes first, right after [
).
Details:
(\[[^][]*])
- Capturing group 1: a[...]
substring|
- or[{}",]
- a{
,}
,,
or"
char.
See a Python demo using your sample input:
import re
s = "\":{},[test1, test2]"
print( re.sub(r'(\[[^][]*])|[{}",]', r'\1', s) )
## => :[test1, test2]
Regex Pattern to Match, Excluding when... / Except between
Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.
First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.
Surprise
Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...
Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.
A Better-Known Variation
There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).
Thanks for all the background, zx81... But what's the recipe?
Key Fact
The method returns the match in Group 1 capture. It does not care at
all about the overall match.
In fact, the trick is to match the various contexts we don't want (chaining these contexts using the |
OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.
The general recipe is
Not_this_context|Not_this_either|StayAway|(WhatYouWant)
This will match Not_this_context
, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.
In your case, with your digits and your three contexts to ignore, we can do:
s1|s2|s3|(\b\d+\b)
Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a |
)
The whole expression can be written like this:
(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)
See this demo (but focus on the capture groups in the lower right pane.)
If you mentally try to split this regex at each |
delimiter, it is actually only a series of four very simple expressions.
For flavors that support free-spacing, this reads particularly well.
(?mx)
### s1: Match line that ends with a period ###
^.*\.$
| ### OR s2: Match anything between parentheses ###
\([^\)]*\)
| ### OR s3: Match any if(...//endif block ###
if\(.*?//endif
| ### OR capture digits to Group 1 ###
(\b\d+\b)
This is exceptionally easy to read and maintain.
Extending the regex
When you want to ignore more situations s4 and s5, you add them in more alternations on the left:
s4|s5|s1|s2|s3|(\b\d+\b)
How does this work?
The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".
The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.
I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.
Debuggex Demo
Perl/PCRE Variation
In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as @CasimiretHippolyte and @HamZa. It is:
(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant
In your case:
(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b
This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant
Note that (*F)
, (*FAIL)
and (?!)
are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)
demo for this version
Applications
Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.
- How can I match foo except anywhere in a tag like
<a stuff...>...</a>
? - How can I match foo except in an
<i>
tag or a javascript snippet (more conditions)? - How can I match all words that are not on this black list?
- How can I ignore anything inside a SUB... END SUB block?
- How can I match everything except... s1 s2 s3?
How to Program the Group 1 Captures
You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.
If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.
Alternatives
Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant)
recipe, if only because clarity always wins out.
1. Replace then Match.
A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of @@@
. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive @@@
strings.
2. Lookarounds.
Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex
module to replace re
in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.
Recycling the regex you had for s3 in C#, the whole pattern would look like this.
(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
But by now you know I'm not recommending this, right?
Deletions
@HamZa and @Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant
. You remember that the recipe to match WhatYouWant
(capturing it into Group 1) was s1|s2|s3|(WhatYouWant)
, right? To delete all instance of WhatYouWant
, you change the regex to
(s1|s2|s3)|WhatYouWant
For the replacement string, you use $1
. What happens here is that for each instance of s1|s2|s3
that is matched, the replacement $1
replaces that instance with itself (referenced by $1
). On the other hand, when WhatYouWant
is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you @HamZa and @Jerry for suggesting this wonderful addition.
Replacements
This brings us to replacements, on which I'll touch briefly.
- When replacing with nothing, see the "Deletions" trick above.
- When replacing, if using Perl or PCRE, use the
(*SKIP)(*F)
variation mentioned above to match exactly what you want, and do a straight replacement. - In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.
Have fun!
No, wait, there's more!
Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.
Regex ignore everything between 2 words
How about:
apple.*(?<!\snot)\s+inc(\.|luded)
Javascript REGEX: Ignore anything in between?
You just want to add a wildcard in between your words by using
.* // 0 or more of any character
Regex rules are by default greedy, so trying to match are
will take precedence over .*
.
To note however you need to take out your square brackets as they'll currently allow things to match that shouldn't. For example they would allow hooooowareyou
to match, as the square brackets allow all of the given characters, and the +
indicates 1 or more.
Try something like:
how.*are.*you
It's unclear if you want all your test cases to pass, if you do then here's an example of this answer in action https://regex101.com/r/aTTU5b/2
A regular expression to exclude a word/string
Here's yet another way (using a negative look-ahead):
^/(?!ignoreme|ignoreme2|ignoremeN)([a-z0-9]+)$
Note: There's only one capturing expression: ([a-z0-9]+)
.
Match within string, but ignore matches between brackets - Regex and JavaScript
Assuming [...]
are balanced and unescaped, you can use a negative lookahead based search:
/out(?![^\[\]]*\])/
(?![^\[\]]*\])
is a negative lookahead that asserts that we don't have a ]
following non-[
and non-]
characters ahead thus making sure we're not matching out
inside a [...]
.
Javascript code to build your regex:
search = "out";
var regex = new RexExp(search + "(?![^\\[\\]]*\\])", "g");
RegEx Demo
RegExp to ignore everything between <code> and <pre> tags
var co = -1, ce = 0, start=0, result;
while ( ce != -1 ) {
co = testString.indexOf('<code', ce);
if (co > -1) {
result += parse(testString.substring(start,co), pattern1);
start = co+1;
ce = testString.indexOf('</code>', co + 5);
if (ce >-1 ){
start = ce + 7;
ce = start;
result += testString.substring(co,ce);
}
}
}
result += parse(testString.substring(start), pattern1);
console.log(result);
Regex to ignore data between different types of brackets
You can use a negated character class looking for ^\)
.
We look for square brackets, and search any character which is not a )
and add \)
at the end to find the closing )
:
(\[[^\)]+\))
This finds: [search](search)
Using capture groups you can remove these from the result programmatically.
Otherwise I believe you will have to parse the string instead of using one regex.
([^\[]+)(\[[^\)]+\))(.+)
$1$3
will be your new string.
Related Topics
How to Update Each Element in a List in Java 8 Using Stream API
Rounding to the Nearest Hundered-Thousandths
How to Print Multiple Variable Lines in Java
Convert Localdatetime to Localdatetime in Utc
In Java, How to Determine If a Char Array Contains a Particular Character
How to Put a Scanner Input into an Array... for Example a Couple of Numbers
Testing @Postconstruct With Mockito
How to Prevent Xss Attacks or Untrusted Data in Rest API Json Using Java
Missing Method Body, or Declare Abstract in Java
How to Run Java Program in Terminal With External Library Jar
Spring Boot - How to Get Running Port and Ip Address
How to Prevent Duplicate Results in Hibernate
A Method to Check If a Collection or Map Is Empty or Null
How to Check Whether a Field Exists or Not in Mongodb
Retrieving Data from Biometric Fingerprint Attendance Device
How to Post Form Data With Spring Resttemplate
How to Solve Liquibase Checksum Validation Fail After Liquibase Upgrade