Count Overlapping Regex Matches in Perl or Ruby

Matching two overlapping patterns with Perl

The following uses a zero-width assertion (I believe that's what it's called).

#!/usr/bin/perl
use strict;
use warnings;

$_ = "betalphabetabeta";

while (/(?=(alpha|beta))/g) {
print $1, "\n";

Prints:

C:\Old_Data\perlp>perl t9.pl
beta
alpha
beta
beta

How do I count regex matches in perl when using multiple possible match targets separated by |?

The problem is that the trailing , is consumed in the ,9, match, so when it starts looking for the next match it starts at 11,12,. There's no leading , before the 11, so it can't match that. I'd recommend using a lookahead like this:

,(4|9|11)(?=,)

This way, the trailing , will not be consumed as part of the match.

For example:

my $string = ",4,8,9,11,12,";

my $test = ",(4|9|11)(?=,)";

my @c = $string =~ m/$test/g;
my $count = @c;
print "count: $count\n";
print "\@c:", join(" ", @c), "\n";

Outputs:

count: 3
@c:4 9 11

How to count matches for a named capture group in Perl

EDIT after question update

  • -0777 means the whole file is read once (input record separator undef)
  • -i : edit file inplace (like sed -i), must be removed to avoid to modify file
  • -p : prints lines

following command should just print the number of matches

perl -0777 -ne '$cnt=@a=m{('$PASSTHROUGH'(*SKIP)(?!)|'$REPLACE')}pg;print "$cnt\n"'

it is done differently :

  • the principle of pattern alternation is to match first what should fail to keep what we want
  • (*SKIP) : is a backtracking control verb which prevent regex engine to backtrack after match fail, that's what is done normally
  • (?!) : is the same as (*FAIL)

Perl - Extract all regular expression match

This is almost the same question as Count overlapping regex matches in Perl OR Ruby.

This code is nearly unchanged from perldoc perlre, under the section titled "Special Backtracking Control Verbs":

use strict;
use warnings;

my $regex = qr/M?[VI]?A?R?G?D?[LM]?G?[IVMAL]?E?/;
my $text = 'VMVARGDLGVE';

my $count = 0;
$text =~ /$regex(?{print "$&\n"; $count++})(*FAIL)/g;
print "Got $count matches\n";

The script does count empty string matches to come up with a count of 97 matches.

Why is Perl lazy when regex matching with * against a group?

This isn't a matter of greedy or lazy repetition. (?:fj)* is greedily matching as many repetitions of "fj" as it can, but it will successfully match zero repetitions. When you try to match it against the string "f fjfj ff", it will first attempt to match at position zero (before the first "f"). The maximum number of times you can successfully match "fj" at position zero is zero, so the pattern successfully matches the empty string. Since the pattern successfully matched at position zero, we're done, and the engine has no reason to try a match at a later position.

The moral of the story is: don't write a pattern that can match nothing, unless you want it to match nothing.

Overlapping matches in R

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

ruby regexp Skipping Zero Length Matches and nil matches

Here's a regex that captures in group #1 everything after postcode/ if it's present, or else everything after .co.uk/:

\.co\.uk\/(?:postcode\/)?([^\/\n]+(?:\/[^\/\n]+)?)

(DEMO)

Note that this will give unexpected results if there are unwanted path elements at the end of a postcode link, such as:

http://www.adresses.co.uk/postcode/rm107jj/oops

UPDATE: Based on the comments, it looks like you want to match just the last path element. But we can't simply capture the second element, because there might be only one:

http://www.adresses.co.uk/west-midlands

We can, however, make the first element optional:

\.co\.uk\­/(?:[^\/\n]+\­/)?([^\/\n]+­)

Notice how I used a non-capturing group for the optional portion, so the part you want is still captured in group #1.

...



Related Topics



Leave a reply



Submit