What's the Fastest Way to Check If a Word from One String Is in Another String

Fastest way to check if a string contains a string from a list

For this I'd suggest firstly tokenize the string with RegexpTokenizer to remove all special characters and then use sets to find the intersection:

from nltk.tokenize import RegexpTokenizer
test_string = "Hello! This is a test. I love to eat apples."

tokenizer = RegexpTokenizer(r'\w+')
test_set = set(tokenizer.tokenize(test_string))
# {'Hello', 'I', 'This', 'a', 'apples', 'eat', 'is', 'love', 'test', 'to'}

Having tokenized the string and constructed a set find the set.intersection:

set(['apples', 'oranges', 'bananas']) & test_set
# {'apples'}

Fastest way to check a string contain another substring in JavaScript?

You have three possibilites:

  1. Regular expression:

     (new RegExp('word')).test(str)
    // or
    /word/.test(str)
  2. indexOf:

     str.indexOf('word') !== -1
  3. includes:

     str.includes('word')

Regular expressions seem to be faster (at least in Chrome 10).

Performance test - short haystack

Performance test - long haystack


**Update 2011:**

It cannot be said with certainty which method is faster. The differences between the browsers is enormous. While in Chrome 10 indexOf seems to be faster, in Safari 5, indexOf is clearly slower than any other method.

You have to see and try for your self. It depends on your needs. For example a case-insensitive search is way faster with regular expressions.


Update 2018:

Just to save people from running the tests themselves, here are the current results for most common browsers, the percentages indicate performance increase over the next fastest result (which varies between browsers):

Chrome: indexOf (~98% faster) <-- wow
Firefox: cached RegExp (~18% faster)

IE11: cached RegExp(~10% faster)

Edge: indexOf (~18% faster)

Safari: cached RegExp(~0.4% faster)

Note that cached RegExp is: var r = new RegExp('simple'); var c = r.test(str); as opposed to: /simple/.test(str)

What is efficient way to check if current word is close to a word in string?

There are a lot of ways to approach this. This one solves all of your examples. I added a minimum similarity filter to return only the higher quality matches. This is what allows the 'ly' to be dropped in the last sample, as it is not all that close any any of the words.

Documentation

You can install levenshtein with pip install python-Levenshtein

import Levenshtein

def find_match(str1,str2):
min_similarity = .75
output = []
results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
for x in results:
if max(x) >= min_similarity:
output.append(str1.split()[x.index(max(x))])
return output

Each sample you proposed.

find_match("is looking good", "looks goo")

['looking','good']

find_match("you are really looking good", "lok goo")

['looking','good']

find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")

['hurts!!!']

find_match("you guys were absolutely amazing tonight, a...", "ly amazin")

['amazing']

What's the fastest way to check if a word from one string is in another string?

If the list of bad words gets huge, a hash is a lot faster:

    require 'benchmark'

bad = ('aaa'..'zzz').to_a # 17576 words
str= "What's the fasted way to check if any word from the bad string is within my "
str += "comparison string, and what's the fastest way to remove said word if it's "
str += "found"
str *= 10

badex = /\b(#{bad.join('|')})\b/i

bad_hash = {}
bad.each{|w| bad_hash[w] = true}

n = 10
Benchmark.bm(10) do |x|

x.report('regex:') {n.times do
str.gsub(badex,'').squeeze(' ')
end}

x.report('hash:') {n.times do
str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
end}

end
user system total real
regex: 10.485000 0.000000 10.485000 ( 13.312500)
hash: 0.000000 0.000000 0.000000 ( 0.000000)

fastest way to check if all elements of a list of strings is in a string

Thanks to BlackBear for pointing out that my timings were skewed
because of the re-computation of loop invariants. On moving them out, things change, drastically.

There are two ways of doing this. The sane way, and the regex way. First, the setup.

string = "My name is Andrew, I am pretty awesome"
choices = [['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]

Option 1
This one performs an in substring check inside a list comprehension. The in check runs on a modified implementation of the Boyer-Moore algorithm in C, and is very fast.

>>> [c for c in choices if all(y in string.lower() for y in c)]
[['andrew', 'name', 'awesome']]

And now, for the timings. But first, a minor performance nitpick; you can cache the value of string.lower() outside the loop, it's an invariant and doesn't need to be re-computed each time -

v = string.lower()
%timeit [c for c in choices if all(y in v for y in c)]
1000000 loops, best of 3: 2.05 µs per loop

Option 2
This one uses re.split + set.issuperset;

>>> import re
>>> [c for c in choices if set(re.split('\W', string.lower())).issuperset(c)]
[['andrew', 'name', 'awesome']]

The use of re.split cannot be avoided, if you want to perform set checks, because of punctuation in your sentences.

Again, the set computation is a loop invariant, and can be moved out. This is how it does -

v = set(re.split('\W', string.lower()))
%timeit [c for c in choices if v.issuperset(c)]
1000000 loops, best of 3: 1.13 µs per loop

This is an exceptional case where I find regular expressions performing marginally faster. However, these timings are not conclusive, because they vastly differ by the data's size and structure. I'd recommend trying things out with your own data before drawing any conclusions, although my gut feeling is that the regex solution would scale poorly.

What is the fastest way to find the occurrence of a string in another string?

strpos seems to be in the lead, I've tested it with finding some strings in 'The quick brown fox jumps over the lazy dog':

  • strstr used 0.48487210273743 seconds for 1000000 iterations finding 'quick'
  • strpos used 0.40836095809937 seconds for 1000000 iterations finding 'quick'
  • strstr used 0.45261287689209 seconds for 1000000 iterations finding 'dog'
  • strpos used 0.39890813827515 seconds for 1000000 iterations finding 'dog'
<?php

$haystack = 'The quick brown fox jumps over the lazy dog';

$needle = 'quick';

$iter = 1000000;

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'quick' in 'The quick brown fox jumps over the lazy dog'";

$needle = 'dog';

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strstr($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strstr used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";

$start = microtime(true);
for ($i = 0; $i < $iter; $i++) {
strpos($haystack, $needle);
}
$duration = microtime(true) - $start;
echo "<br/>strpos used $duration microseconds for $iter iterations finding 'dog' in 'The quick brown fox jumps over the lazy dog'";

?>

Fastest way to check whether a string is a substring in a list of strings

You could look into turning your list of names into one regular expression. Take for example this tiny list of names:

names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]

That could be represented with the following regular expression:

\b(?:AARON|ABDUL|ABE|ABEL|ABRAHAM|ABRAM|ADALBERTO|ADAM|ADAN|ADOLFO|ADOLPH|ADRIAN)\b

But that would not be very efficient. A regular expression that is built like a tree will work better:

\b(?:A(?:B(?:E(?:|L)|RA(?:M|HAM)|DUL)|D(?:A(?:M|N|LBERTO)|OL(?:FO|PH)|RIAN)|ARON))\b

You could then automate the production of this regular expression -- possibly by first creating a dict-tree structure from the list of names, and then translating that tree into a regular expression. For the above example, that intermediate tree would look like this:

{
'A': {
'A': {
'R': {
'O': {
'N': {
'': {}
}
}
}
},
'B': {
'D': {
'U': {
'L': {
'': {}
}
}
},
'E': {
'': {},
'L': {
'': {}
}
},
... etc

... which could optionally be simplified to this:

{
'A': {
'ARON': {
'': {}
}
'B': {
'DUL': {
'': {}
},
'E': {
'': {},
'L': {
'': {}
}
},
'RA': {
'HAM': {
'': {}
},
'M': {
'': {}
}
}
},

... etc

Here is the suggested code to do this:

import re 

def addToTree(tree, name):
if len(name) == 0:
return
if name[0] in tree.keys():
addToTree(tree[name[0]], name[1:])
else:
for letter in name:
tree[letter] = {}
tree = tree[letter]
tree[''] = {}

# Optional improvement of the tree: it combines several consecutive letters into
# one key if there are no alternatives possible
def simplifyTree(tree):
repeat = True
while repeat:
repeat = False
for key, subtree in list(tree.items()):
if key != '' and len(subtree) == 1 and '' not in subtree.keys():
for letter, subsubtree in subtree.items():
tree[key + letter] = subsubtree
del tree[key]
repeat = True
for key, subtree in tree.items():
if key != '':
simplifyTree(subtree)

def treeToRegExp(tree):
regexp = [re.escape(key) + treeToRegExp(subtree) for key, subtree in tree.items()]
regexp = '|'.join(regexp)
return '' if regexp == '' else '(?:' + regexp + ')'

def listToRegExp(names):
tree = {}
for name in names:
addToTree(tree, name[:])
simplifyTree(tree)
return re.compile(r'\b' + treeToRegExp(tree) + r'\b', re.I)

# Demo
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]

fields = [
'This is Aaron speaking',
'Is Abex a name?',
'Where did Abraham get the mustard from?'
]

regexp = listToRegExp(names)
# get the search result for each field, and link it with the index of the field
results = [[i, regexp.search(field)] for i, field in enumerate(fields)]
# remove non-matches from the results
results = [[i, match.group(0)] for [i, match] in results if match]
# print results
print(results)

See it run on repl.it

Fastest way to check does string contain any word from list

There is the any built-in function specially for that:

>>> message = "sometext"
>>> lista = ["a","b","c"]
>>> any(a in message for a in lista)
False
>>> lista = ["a","b","e"]
>>> any(a in message for a in lista)
True

Alternatively you could check the intersection of the sets:

>>> lista = ["a","b","c"]
>>> set(message) & set(lista)
set([])
>>> lista = ["a","b","e"]
>>> set(message) & set(lista)
set(['e'])
>>> set(['test','sentence'])&set(['this','is','my','sentence'])
set(['sentence'])

But you won't be able to check for subwords:

>>> set(['test','sentence'])&set(['this is my sentence'])

How do I check if a string contains a specific word?

Now with PHP 8 you can do this using str_contains:

if (str_contains('How are you', 'are')) { 
echo 'true';
}

RFC

Before PHP 8

You can use the strpos() function which is used to find the occurrence of one string inside another one:

$haystack = 'How are you?';
$needle = 'are';

if (strpos($haystack, $needle) !== false) {
echo 'true';
}

Note that the use of !== false is deliberate (neither != false nor === true will return the desired result); strpos() returns either the offset at which the needle string begins in the haystack string, or the boolean false if the needle isn't found. Since 0 is a valid offset and 0 is "falsey", we can't use simpler constructs like !strpos($a, 'are').

Check list of words in another string

if any(word in 'some one long two phrase three' for word in list_):


Related Topics



Leave a reply



Submit