Fast Way to Find Difference between Two Strings of Equal Length in Perl
Stringwise ^ is your friend:
use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}
EXPLANATION:
The ^
(exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:
"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
" "
The useful feature of this here is that a nul character ("\0"
) occurs when and only when the two strings have the same character at a given position. So ^
can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0]
, which gives the offset of the beginning of the last successful match.
How do I compare two strings in Perl?
See perldoc perlop. Use lt
, gt
, eq
, ne
, and cmp
as appropriate for string comparisons:
Binary
eq
returns true if the left argument is stringwise equal to the right argument.Binary
ne
returns true if the left argument is stringwise not equal to the right argument.Binary
cmp
returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.Binary
~~
does a smartmatch between its arguments. ...
lt
,le
,ge
,gt
andcmp
use the collation (sort) order specified by the current locale if a legacy use locale (but notuse locale ':not_characters'
) is in effect. See perllocale. Do not mix these with Unicode, only with legacy binary encodings. The standard Unicode::Collate and Unicode::Collate::Locale modules offer much more powerful solutions to collation issues.
Fastest Way To Find Mismatch Positions Between Two Strings of the Same Length
Inline::C
The computation is easy, do it with Inline::C
(read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):
use Inline C => << '...';
void find_diffs(char* x, char* y) {
int i;
Inline_Stack_Vars;
Inline_Stack_Reset;
for(i=0; x[i] && y[i]; ++i) {
if(x[i] != y[i]) {
Inline_Stack_Push(sv_2mortal(newSViv(i)));
}
}
Inline_Stack_Done;
}
...
@diffs= find_diffs("ATTCCGGG","ATTGCGGG"); print "@diffs\n";
@diffs= find_diffs("ATTCCGGG","ATACCGGC"); print "@diffs\n";
Here is the output of this script:
> script.pl
3
2 7
PDL
If you want to process a lot of data fast in Perl, learn PDL (Documentation):
use PDL;
use PDL::Char;
$PDL::SHARE=$PDL::SHARE; # keep stray warning quiet
my $source=PDL::Char->new("ATTCCGGG");
for my $str ( "ATTGCGGG", "ATACCGGC") {
my $match =PDL::Char->new($str);
my @diff=which($match!=$source)->list;
print "@diff\n";
}
(Same output as first script.)
Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.
Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.
Fast Way to Find Difference between Two Unicode Strings in Perl vol 2
Encode the string using a fixed-width encoding.
my $s1 = encode('UTF-32', $original_string_1);
my $s2 = encode('UTF-32', $original_string_2);
my $mask = $s1 ^ $s2;
while ($mask =~ /\G(?:\0{4})*+(.{4})/sg) {
my $pos = $-[1] / 4;
printf "%d %s %s\n",
$pos,
substr($original_string_1, $pos, 1),
substr($original_string_2, $pos, 1);
}
Of course, this still has the problems of using XOR (has problems with insertions and deletions, as opposed to replacements). Algorithm::Diff provides something that "resynchronizes".
Also, be wary that not all code points are suitable for printing on their own. Control characters, continuation marks, and other code points should probably not printed out directly.
Compare two strings and highlight mismatch characters in Perl
Use:
use strict;
use warnings;
my $string1 = 'AAABBBBBCCCCCDDDDD';
my $string2 = 'AEABBBBBCCECCDDDDD';
my $result = '';
for(0 .. length($string1)) {
my $char = substr($string2, $_, 1);
if($char ne substr($string1, $_, 1)) {
$result .= "**$char**";
} else {
$result .= $char;
}
}
print $result;
It prints A**E**ABBBBBCC**E**CCDDDDD
and was tested somewhat, but it may contain errors.
perl count mismatch between two strings
Just XOR the two strings together. Each NUL character in the result represents a position where the characters are the same in both strings.
my ($s1, $s2) = qw( ATCG ATTG );
my $count = ( $s1 ^ $s2 ) =~ tr/\0//c;
print "$count\n"; # Prints "1"
Note: If you're going to repeatedly compare a string, pass it and the one to which you are comparing it to utf8::downgrade
to makes sure the ^
is as fast as it can be.
utf8::downgrade($s1); # Change the internal format in which
utf8::downgrade($s2); # the strings are stored to speed up $s1^$s2.
This is useless/wasteful if either string contains UNICODE chars above U+00FF.
Compare two strings regardless of case size in perl
yes - use uc()
(upper-case function; see http://perldoc.perl.org/functions/uc.html )
$ perl -e 'print uc("steve") eq uc("STevE"); print "\n";'
1
$ perl -e 'print uc("SHOE") eq uc("shoe"); print "\n";'
1
$ perl5.8 -e 'print uc("SHOE") eq uc("shoe1"); print "\n";'
$
You can obviously use lc()
as well.
If you want the actual "eq" operator to be case insensitive, it might be possible using overloads but I don't think that's what you are asking for - please clarify your question if that's the case. Nor is it a great idea if you do want that, IMHO - too fragile and leads to major possible hard to trace and debug bugs.
Also, it's an overkill in your specific case where you just want equality, but Perl regular expressions also have case-independent modifyer "i"
Finding length of common prefix in two strings
Like this, perhaps?
It's written in Perl
use strict;
use warnings 'all';
my $prev = "";
while ( my $line = <DATA> ) {
chomp $line;
my $max = 0;
++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);
printf "%-2d %s\n", $max-1, $line;
$prev = $line;
}
__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/
output
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]
Character match count between strings in Perl
Exclusive or, then count the null characters (where the strings were the same):
my $string1 = "stranger";
my $string2 = "strangem";
my $count = ( lc $string1 ^ lc $string2 ) =~ tr/\0//;
print "$count\n";
I missed the "case in-sensitive" bit.
Related Topics
Can't Build 32Bit Wine on 64Bit Linux
Linux: Create Random Directory/File Hierarchy
Bash Script to Get All Ip Addresses
What Is This $Path in Linux and How to Modify It
Redirecting Man Page Output to File Results in Double Letters in Words
Difference Between Completion Variables and Semaphores
Elf Program Header Virtual Address and File Offset
Using Perf to Monitor Raw Event Counters
How to Pass All Arguments with Xargs in Middle of Command in Linux
Determine the Os Version, Linux and Windows from Powershell
Can an Interrupt Handler Be Preempted
How to Catch the L3-Cache Hits and Misses by Perf Tool in Linux
Can Malloc_Trim() Release Memory from the Middle of the Heap
Let Non-Root User Write to Linux Host in Docker
Errors in Make File:*** Missing Separator. Stop
How to Flush Cache of Hard-Disk and Flash-Disk (Or Filesystem) from Command Line