Regexp finding longest common prefix of two strings
If there's some character that neither string contains —, say, \0
— you could write
"$first\0$second" =~ m/^(.*).*\0\1/s;
and the longest common prefix would be saved as $1
.
Edited to add: This is obviously very inefficient. I think that if efficiency is a concern, then this simply isn't the approach we should be using; but we can at least improve it by changing .*
to [^\0]*
to prevent useless greediness that will just have to be backtracked again, and wrapping the second [^\0]*
in (?>…)
to prevent backtracking that can't help. This:
"$first\0$second" =~ m/^([^\0]*)(?>[^\0]*)\0\1/s;
This will yield the same result, but much more efficiently. (But still not nearly as efficiently as a straightforward non–regex-based approach. If the strings both have length n, I'd expect its worst case to take at least O(n2) time, whereas the straightforward non–regex-based approach would take O(n) time in its worst case.)
Perl - longest common prefix of 2 or more strings?
One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.
Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort()
takes care of finding the longest one).
The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()
ed into a scalar and put into the hash.
use warnings;
use strict;
my %prefixes;
while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}
my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";
__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
Output:
aaa
Regex for longest matching sequence between two strings
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)\b.*?^(?:Item 2|Item 2a|Item 2b)\b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:^
- start of a line((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to add\b
word boundary after)
here).*?
- any 0+ chars, as few as possible^
- start of a line(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to add\b
word boundary after)
here).
Longest common prefix of two strings in bash
In sed, assuming the strings don't contain any newline characters:
string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'
Finding length of common prefix in two strings
Like this, perhaps?
It's written in Perl
use strict;
use warnings 'all';
my $prev = "";
while ( my $line = <DATA> ) {
chomp $line;
my $max = 0;
++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);
printf "%-2d %s\n", $max-1, $line;
$prev = $line;
}
__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/
output
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]
How to find the longest common suffix prefix between two strings in python in a pythonic way possibly using library functions?
The logic in your algorithm is about as straightforward as you can get, but you can definitely compactify the notation. For example, checking a prefix of size n
against a suffix of size n
is simply:
s1[-n:] == s2[:n]
The ternary operator you use to check string lengths is
min(len(s1), len(s2))
A range can go backwards by itself. The reverse of range(x)
is
range(x - 1, -1, -1)
You can create an iterator that checks this for each decreasing value of n
and return the first non-zero result. Luckily, next
accepts a second argument that represents the default if the iterator is empty:
common = next((s2[:n] for n in range(min(len(s1), len(s2)) - 1, -1, -1) if s1[-n:] == s2[:n]), '')
That's the obligatory one-liner. A more legible solution might be:
def common_fix(s1, s2):
steps = range(min(len(s1), len(s2)) - 1, -1, -1)
return next((s2[:n] for n in steps if s1[-n:] == s2[:n]), '')
As a rule, keep your functionally and printing separate. Get a value, then process it (whether by printing or something else)
Related Topics
How to Embed HTML into Ipython Output
Checking If Element Exists With Python Selenium
What's the Easiest Way to Escape HTML in Python
Application Not Picking Up .Css File (Flask/Python)
How to Terminate Process from Python Using Pid
How to Get Pid by Process Name
Python Ftp Get the Most Recent File by Date
Python Subprocess.Popen "Oserror: [Errno 12] Cannot Allocate Memory"
Fatal Error: Python.H: No Such File or Directory
How to Sort a List/Tuple of Lists/Tuples by the Element At a Given Index
Check If Multiple Strings Exist in Another String
Passing HTML to Template Using Flask/Jinja2
Selenium Using Python - Geckodriver Executable Needs to Be in Path
Tkinter Understanding Mainloop
How to Match a Whole Word With a Regular Expression
What Does "Syntaxerror: Missing Parentheses in Call to 'Print'" Mean in Python
Flask Raises Templatenotfound Error Even Though Template File Exists