What Is the Fastest Way to Find the Occurrence of a String in Another String

What is the fastest way to find the occurrence of a string in another string?

strpos seems to be in the lead, I've tested it with finding some strings in 'The quick brown fox jumps over the lazy dog':

  • strstr used 0.48487210273743 seconds for 1000000 iterations finding 'quick'
  • strpos used 0.40836095809937 seconds for 1000000 iterations finding 'quick'
  • strstr used 0.45261287689209 seconds for 1000000 iterations finding 'dog'
  • strpos used 0.39890813827515 seconds for 1000000 iterations finding 'dog'
<?php

$haystack = 'The quick brown fox jumps over the lazy dog';

$needle = 'quick';

$iter = 1000000;

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";

$needle = 'dog';

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";

?>

Fastest way to check a string contain another substring in JavaScript?

You have three possibilites:

  1. Regular expression:

     (new RegExp('word')).test(str)
    // or
    /word/.test(str)
  2. indexOf:

     str.indexOf('word') !== -1
  3. includes:

     str.includes('word')

Regular expressions seem to be faster (at least in Chrome 10).

Performance test - short haystack

Performance test - long haystack


**Update 2011:**

It cannot be said with certainty which method is faster. The differences between the browsers is enormous. While in Chrome 10 indexOf seems to be faster, in Safari 5, indexOf is clearly slower than any other method.

You have to see and try for your self. It depends on your needs. For example a case-insensitive search is way faster with regular expressions.


Update 2018:

Just to save people from running the tests themselves, here are the current results for most common browsers, the percentages indicate performance increase over the next fastest result (which varies between browsers):

Chrome: indexOf (~98% faster) <-- wow
Firefox: cached RegExp (~18% faster)

IE11: cached RegExp(~10% faster)

Edge: indexOf (~18% faster)

Safari: cached RegExp(~0.4% faster)

Note that cached RegExp is: var r = new RegExp('simple'); var c = r.test(str); as opposed to: /simple/.test(str)

Find the Number of Occurrences of a Substring in a String

The last line was creating a problem. lastIndex would never be at -1, so there would be an infinite loop. This can be fixed by moving the last line of code into the if block.

String str = "helloslkhellodjladfjhello";
String findStr = "hello";
int lastIndex = 0;
int count = 0;

while(lastIndex != -1){

lastIndex = str.indexOf(findStr,lastIndex);

if(lastIndex != -1){
count ++;
lastIndex += findStr.length();
}
}
System.out.println(count);

How to count string occurrence in string?

The g in the regular expression (short for global) says to search the whole string rather than just find the first occurrence. This matches is twice:

var temp = "This is a string.";var count = (temp.match(/is/g) || []).length;console.log(count);

What's the fastest way to find all occurences of the substring in large string in python

To be honest, I think you've been doing the right things.

There are a few more tweaks you can make to your code, though. In general, when performance is key, only do the bare minimum in your innermost loops. Looking through your code, there are still some quick optimizations left on this front:

  1. Use if...elif instead of if...if.
  2. Don't use lists where you don't have to - e.g. just a single string is sufficient for nts and the reverse.
  3. Don't evaluate the same result multiple times - e.g. the reverse lookup.

I'm guessing that your problem with multi-processing is down to the serialization of these very large strings, offsetting any performance gain you might have from running in parallel. However, there may be another way to do this - see Parallelizing a Numpy vector operation. I can't verify as I am having difficulty installing numexpr.

Putting them together and trying out some of the other suggestions in this trail, I get the following results:

$ python test.py
Original serial took 0.08330821593602498 minutes...
Using sets took 0.09072601397832235 minutes...
Using built-ins took 0.061421032746632895 minutes...
Using regex took 0.11649663050969442 minutes...
Optimized serial took 0.05909080108006795 minutes...
Original numpy took 0.04050511916478475 minutes...
Optimized numpy took 0.03438538312911987 minutes...

My code is as follows.

import time
import numpy as np
from random import choice
import re

# Create single large chromosome for the test...
seq = ""
for i in range(10000000):
seq += choice("ATGCN")
seq_d = {"Chromosome1": seq}

rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
nts = 'T'

# Original serial implementation
def serial():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt in nts:
plus_pos.append(pos)
if rev_comps[nt] in nts:
minus_pos.append(pos)

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Optimized for single character tests
def serial2():
test_d = {'+': {}, '-': {}}
rev_nts = rev_comps[nts]
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt == nts:
plus_pos.append(pos)
elif nt == rev_nts:
minus_pos.append(pos)

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Use sets instead of lists
def set_style():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = set(), set()
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt in nts:
plus_pos.add(pos)
if rev_comps[nt] in nts:
minus_pos.add(pos)

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Use regex to find either nucleotide...
def regex_it():
test_d = { '+' : {}, '-' : {}}
search = re.compile("(T|A)")
for chrom in seq_d:
pos = 0
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for sub_seq in search.split(chrom_seq):
if len(sub_seq) == 0:
continue
if sub_seq[0] == 'T':
plus_pos.append(pos)
elif sub_seq[0] == 'A':
minus_pos.append(pos)
pos += len(sub_seq)

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Use str.find instead of iteration
def use_builtins():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
start = 0
while True:
pos = chrom_seq.find("T", start)
if pos == -1:
break
plus_pos.append(pos)
start = pos + 1

start = 0
while True:
pos = chrom_seq.find("A", start)
if pos == -1:
break
minus_pos.append(pos)
start = pos + 1

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Original numpy implementation
def numpy1():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
chrom_seq = np.array(list(seq_d[chrom]))
for_nts = list(nts)
rev_nts = [rev_comps[nt] for nt in nts]
plus_pos = list(np.where(np.in1d(chrom_seq, for_nts) == True)[0])
minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

# Optimized for single character look-ups
def numpy2():
test_d = {'+': {}, '-': {}}
rev_nts = rev_comps[nts]
for chrom in seq_d:
chrom_seq = np.array(list(seq_d[chrom]))
plus_pos = np.where(chrom_seq == nts)
minus_pos = np.where(chrom_seq == rev_nts)

test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos

for fn, name in [
(serial, "Original serial"),
(set_style, "Using sets"),
(use_builtins, "Using built-ins"),
(regex_it, "Using regex"),
(serial2, "Optimized serial"),
(numpy1, "Original numpy"),
(numpy2, "Optimized numpy")]:
s = time.time()
fn()
e = time.time()
print('{} took {} minutes...'.format(name, (e-s)/60))

Check if multiple strings exist in another string

You can use any:

a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]

if any(x in a_string for x in matches):

Similarly to check if all the strings from the list are found, use all instead of any.



Related Topics



Leave a reply



Submit