Fast Way to Find Difference Between Two Strings of Equal Length in Perl

Fast Way to Find Difference between Two Strings of Equal Length in Perl

Stringwise ^ is your friend:

use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}

EXPLANATION:

The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:

"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
" "

The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

How do I compare two strings in Perl?

See perldoc perlop. Use lt, gt, eq, ne, and cmp as appropriate for string comparisons:

Binary eq returns true if the left argument is stringwise equal to the right argument.

Binary ne returns true if the left argument is stringwise not equal to the right argument.

Binary cmp returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.

Binary ~~ does a smartmatch between its arguments. ...

lt, le, ge, gt and cmp use the collation (sort) order specified by the current locale if a legacy use locale (but not use locale ':not_characters') is in effect. See perllocale. Do not mix these with Unicode, only with legacy binary encodings. The standard Unicode::Collate and Unicode::Collate::Locale modules offer much more powerful solutions to collation issues.

Fastest Way To Find Mismatch Positions Between Two Strings of the Same Length


Inline::C



The computation is easy, do it with Inline::C
(read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):

use Inline C => << '...';                                                       
void find_diffs(char* x, char* y) {
int i;
Inline_Stack_Vars;
Inline_Stack_Reset;
for(i=0; x[i] && y[i]; ++i) {
if(x[i] != y[i]) {
Inline_Stack_Push(sv_2mortal(newSViv(i)));
}
}
Inline_Stack_Done;
}
...

@diffs= find_diffs("ATTCCGGG","ATTGCGGG"); print "@diffs\n";
@diffs= find_diffs("ATTCCGGG","ATACCGGC"); print "@diffs\n";

Here is the output of this script:

> script.pl 
3
2 7

PDL

If you want to process a lot of data fast in Perl, learn PDL (Documentation):

use PDL; 
use PDL::Char;
$PDL::SHARE=$PDL::SHARE; # keep stray warning quiet

my $source=PDL::Char->new("ATTCCGGG");
for my $str ( "ATTGCGGG", "ATACCGGC") {
my $match =PDL::Char->new($str);
my @diff=which($match!=$source)->list;
print "@diff\n";
}

(Same output as first script.)

Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.

Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.

Fast Way to Find Difference between Two Unicode Strings in Perl vol 2

Encode the string using a fixed-width encoding.

my $s1 = encode('UTF-32', $original_string_1);
my $s2 = encode('UTF-32', $original_string_2);

my $mask = $s1 ^ $s2;
while ($mask =~ /\G(?:\0{4})*+(.{4})/sg) {
my $pos = $-[1] / 4;
printf "%d %s %s\n",
$pos,
substr($original_string_1, $pos, 1),
substr($original_string_2, $pos, 1);
}

Of course, this still has the problems of using XOR (has problems with insertions and deletions, as opposed to replacements). Algorithm::Diff provides something that "resynchronizes".

Also, be wary that not all code points are suitable for printing on their own. Control characters, continuation marks, and other code points should probably not printed out directly.

Compare two strings and highlight mismatch characters in Perl

Use:

use strict;
use warnings;

my $string1 = 'AAABBBBBCCCCCDDDDD';
my $string2 = 'AEABBBBBCCECCDDDDD';
my $result = '';
for(0 .. length($string1)) {
my $char = substr($string2, $_, 1);
if($char ne substr($string1, $_, 1)) {
$result .= "**$char**";
} else {
$result .= $char;
}
}
print $result;

It prints A**E**ABBBBBCC**E**CCDDDDD and was tested somewhat, but it may contain errors.

perl count mismatch between two strings

Just XOR the two strings together. Each NUL character in the result represents a position where the characters are the same in both strings.

my ($s1, $s2) = qw( ATCG ATTG );

my $count = ( $s1 ^ $s2 ) =~ tr/\0//c;

print "$count\n"; # Prints "1"

Note: If you're going to repeatedly compare a string, pass it and the one to which you are comparing it to utf8::downgrade to makes sure the ^ is as fast as it can be.

utf8::downgrade($s1);  # Change the internal format in which
utf8::downgrade($s2); # the strings are stored to speed up $s1^$s2.

This is useless/wasteful if either string contains UNICODE chars above U+00FF.

Compare two strings regardless of case size in perl

yes - use uc() (upper-case function; see http://perldoc.perl.org/functions/uc.html )

$ perl -e 'print uc("steve") eq uc("STevE"); print "\n";'
1
$ perl -e 'print uc("SHOE") eq uc("shoe"); print "\n";'
1
$ perl5.8 -e 'print uc("SHOE") eq uc("shoe1"); print "\n";'

$

You can obviously use lc() as well.

If you want the actual "eq" operator to be case insensitive, it might be possible using overloads but I don't think that's what you are asking for - please clarify your question if that's the case. Nor is it a great idea if you do want that, IMHO - too fragile and leads to major possible hard to trace and debug bugs.

Also, it's an overkill in your specific case where you just want equality, but Perl regular expressions also have case-independent modifyer "i"

Finding length of common prefix in two strings

Like this, perhaps?

It's written in Perl

use strict;
use warnings 'all';

my $prev = "";

while ( my $line = <DATA> ) {

chomp $line;

my $max = 0;
++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

printf "%-2d %s\n", $max-1, $line;

$prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

output

0   #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]

Character match count between strings in Perl

Exclusive or, then count the null characters (where the strings were the same):

my $string1 = "stranger";
my $string2 = "strangem";
my $count = ( lc $string1 ^ lc $string2 ) =~ tr/\0//;

print "$count\n";

I missed the "case in-sensitive" bit.



Related Topics



Leave a reply



Submit