What is the fastest way to find the occurrence of a string in another string?
strpos
seems to be in the lead, I've tested it with finding some strings in 'The quick brown fox jumps over the lazy dog'
:
strstr
used 0.48487210273743 seconds for 1000000 iterations finding'quick'
strpos
used 0.40836095809937 seconds for 1000000 iterations finding'quick'
strstr
used 0.45261287689209 seconds for 1000000 iterations finding'dog'
strpos
used 0.39890813827515 seconds for 1000000 iterations finding'dog'
<?php
$haystack = 'The quick brown fox jumps over the lazy dog';
$needle = 'quick';
$iter = 1000000;
$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";
$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";
$needle = 'dog';
$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";
$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";
?>
Fastest way to check a string contain another substring in JavaScript?
You have three possibilites:
Regular expression:
(new RegExp('word')).test(str)
// or
/word/.test(str)indexOf
:str.indexOf('word') !== -1
includes
:str.includes('word')
Regular expressions seem to be faster (at least in Chrome 10).
Performance test - short haystack
Performance test - long haystack
**Update 2011:**
It cannot be said with certainty which method is faster. The differences between the browsers is enormous. While in Chrome 10 indexOf
seems to be faster, in Safari 5, indexOf
is clearly slower than any other method.
You have to see and try for your self. It depends on your needs. For example a case-insensitive search is way faster with regular expressions.
Update 2018:
Just to save people from running the tests themselves, here are the current results for most common browsers, the percentages indicate performance increase over the next fastest result (which varies between browsers):
Chrome: indexOf (~98% faster) <-- wow
Firefox: cached RegExp (~18% faster)
IE11: cached RegExp(~10% faster)
Edge: indexOf (~18% faster)
Safari: cached RegExp(~0.4% faster)
Note that cached RegExp is: var r = new RegExp('simple'); var c = r.test(str);
as opposed to: /simple/.test(str)
Find the Number of Occurrences of a Substring in a String
The last line was creating a problem. lastIndex
would never be at -1, so there would be an infinite loop. This can be fixed by moving the last line of code into the if block.
String str = "helloslkhellodjladfjhello";
String findStr = "hello";
int lastIndex = 0;
int count = 0;
while(lastIndex != -1){
lastIndex = str.indexOf(findStr,lastIndex);
if(lastIndex != -1){
count ++;
lastIndex += findStr.length();
}
}
System.out.println(count);
How to count string occurrence in string?
The g
in the regular expression (short for global) says to search the whole string rather than just find the first occurrence. This matches is
twice:
var temp = "This is a string.";var count = (temp.match(/is/g) || []).length;console.log(count);
What's the fastest way to find all occurences of the substring in large string in python
To be honest, I think you've been doing the right things.
There are a few more tweaks you can make to your code, though. In general, when performance is key, only do the bare minimum in your innermost loops. Looking through your code, there are still some quick optimizations left on this front:
- Use if...elif instead of if...if.
- Don't use lists where you don't have to - e.g. just a single string is sufficient for nts and the reverse.
- Don't evaluate the same result multiple times - e.g. the reverse lookup.
I'm guessing that your problem with multi-processing is down to the serialization of these very large strings, offsetting any performance gain you might have from running in parallel. However, there may be another way to do this - see Parallelizing a Numpy vector operation. I can't verify as I am having difficulty installing numexpr
.
Putting them together and trying out some of the other suggestions in this trail, I get the following results:
$ python test.py
Original serial took 0.08330821593602498 minutes...
Using sets took 0.09072601397832235 minutes...
Using built-ins took 0.061421032746632895 minutes...
Using regex took 0.11649663050969442 minutes...
Optimized serial took 0.05909080108006795 minutes...
Original numpy took 0.04050511916478475 minutes...
Optimized numpy took 0.03438538312911987 minutes...
My code is as follows.
import time
import numpy as np
from random import choice
import re
# Create single large chromosome for the test...
seq = ""
for i in range(10000000):
seq += choice("ATGCN")
seq_d = {"Chromosome1": seq}
rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
nts = 'T'
# Original serial implementation
def serial():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt in nts:
plus_pos.append(pos)
if rev_comps[nt] in nts:
minus_pos.append(pos)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Optimized for single character tests
def serial2():
test_d = {'+': {}, '-': {}}
rev_nts = rev_comps[nts]
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt == nts:
plus_pos.append(pos)
elif nt == rev_nts:
minus_pos.append(pos)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Use sets instead of lists
def set_style():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = set(), set()
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt in nts:
plus_pos.add(pos)
if rev_comps[nt] in nts:
minus_pos.add(pos)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Use regex to find either nucleotide...
def regex_it():
test_d = { '+' : {}, '-' : {}}
search = re.compile("(T|A)")
for chrom in seq_d:
pos = 0
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for sub_seq in search.split(chrom_seq):
if len(sub_seq) == 0:
continue
if sub_seq[0] == 'T':
plus_pos.append(pos)
elif sub_seq[0] == 'A':
minus_pos.append(pos)
pos += len(sub_seq)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Use str.find instead of iteration
def use_builtins():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
start = 0
while True:
pos = chrom_seq.find("T", start)
if pos == -1:
break
plus_pos.append(pos)
start = pos + 1
start = 0
while True:
pos = chrom_seq.find("A", start)
if pos == -1:
break
minus_pos.append(pos)
start = pos + 1
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Original numpy implementation
def numpy1():
test_d = { '+' : {}, '-' : {}}
for chrom in seq_d:
chrom_seq = np.array(list(seq_d[chrom]))
for_nts = list(nts)
rev_nts = [rev_comps[nt] for nt in nts]
plus_pos = list(np.where(np.in1d(chrom_seq, for_nts) == True)[0])
minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
# Optimized for single character look-ups
def numpy2():
test_d = {'+': {}, '-': {}}
rev_nts = rev_comps[nts]
for chrom in seq_d:
chrom_seq = np.array(list(seq_d[chrom]))
plus_pos = np.where(chrom_seq == nts)
minus_pos = np.where(chrom_seq == rev_nts)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
for fn, name in [
(serial, "Original serial"),
(set_style, "Using sets"),
(use_builtins, "Using built-ins"),
(regex_it, "Using regex"),
(serial2, "Optimized serial"),
(numpy1, "Original numpy"),
(numpy2, "Optimized numpy")]:
s = time.time()
fn()
e = time.time()
print('{} took {} minutes...'.format(name, (e-s)/60))
Check if multiple strings exist in another string
You can use any
:
a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]
if any(x in a_string for x in matches):
Similarly to check if all the strings from the list are found, use all
instead of any
.
Related Topics
How to Get Results from the Wikipedia API with PHP
How to Create PHP Two Column Table with Values from the Database
Strip HTML Tags and Its Contents
How to Pass Parameters from Bash to PHP Script
Comparing Floats - Same Number, But Does Not Equal
PHP Upload Progress in PHP 5.4 Is Not Working. Session Variables Not Set
Simple PHP Mail Function Not Working on Amazon Server Ec2
Using Pdo Prepared Statement and Incrementing a Column Value
HTML Purifier: Removing an Element Conditionally Based on Its Attributes
Regex for Dropping Http:// and Www. from Urls
Do I Need a Trailing Semicolon Here
Retrieving Get and Post Data Inside Laravel Controller
How to Merge Subarray Values and Remove Duplicates
Confirmation That PHP Static Variables Do Not Persist Across Requests
How to Have Multiple $_Get with the Same Key, Different Values