Regexp Finding Longest Common Prefix of Two Strings

Regexp finding longest common prefix of two strings

If there's some character that neither string contains —, say, \0 — you could write

"$first\0$second" =~ m/^(.*).*\0\1/s;

and the longest common prefix would be saved as $1.


Edited to add: This is obviously very inefficient. I think that if efficiency is a concern, then this simply isn't the approach we should be using; but we can at least improve it by changing .* to [^\0]* to prevent useless greediness that will just have to be backtracked again, and wrapping the second [^\0]* in (?>…) to prevent backtracking that can't help. This:

"$first\0$second" =~ m/^([^\0]*)(?>[^\0]*)\0\1/s;

This will yield the same result, but much more efficiently. (But still not nearly as efficiently as a straightforward non–regex-based approach. If the strings both have length n, I'd expect its worst case to take at least O(n2) time, whereas the straightforward non–regex-based approach would take O(n) time in its worst case.)

Perl - longest common prefix of 2 or more strings?

One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.

Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort() takes care of finding the longest one).

The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()ed into a scalar and put into the hash.

use warnings;
use strict;

my %prefixes;

while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}

my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";

__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG

Output:

aaa

Regex for longest matching sequence between two strings

You need to

  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
  • Make sure the delimiters are matches at the line start positions only
  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)
  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)
  • Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))

Output:

Item 1a ....
....
....
....
....
Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)\b.*?^(?:Item 2|Item 2a|Item 2b)\b))

See the regex demo.

Details

  • (?sm) - re.S and re.M flags
  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:

    • ^ - start of a line
    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add \b word boundary after ) here)
    • .*? - any 0+ chars, as few as possible
    • ^ - start of a line
    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add \b word boundary after ) here).

Longest common prefix of two strings in bash

In sed, assuming the strings don't contain any newline characters:

string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

Finding length of common prefix in two strings

Like this, perhaps?

It's written in Perl

use strict;
use warnings 'all';

my $prev = "";

while ( my $line = <DATA> ) {

chomp $line;

my $max = 0;
++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

printf "%-2d %s\n", $max-1, $line;

$prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

output

0   #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]

How to find the longest common suffix prefix between two strings in python in a pythonic way possibly using library functions?

The logic in your algorithm is about as straightforward as you can get, but you can definitely compactify the notation. For example, checking a prefix of size n against a suffix of size n is simply:

s1[-n:] == s2[:n]

The ternary operator you use to check string lengths is

min(len(s1), len(s2))

A range can go backwards by itself. The reverse of range(x) is

range(x - 1, -1, -1)

You can create an iterator that checks this for each decreasing value of n and return the first non-zero result. Luckily, next accepts a second argument that represents the default if the iterator is empty:

common = next((s2[:n] for n in range(min(len(s1), len(s2)) - 1, -1, -1) if s1[-n:] == s2[:n]), '')

That's the obligatory one-liner. A more legible solution might be:

def common_fix(s1, s2):
steps = range(min(len(s1), len(s2)) - 1, -1, -1)
return next((s2[:n] for n in steps if s1[-n:] == s2[:n]), '')

As a rule, keep your functionally and printing separate. Get a value, then process it (whether by printing or something else)



Related Topics



Leave a reply



Submit