Find Out Which Words in a Large List Occur in a Small String

Find out which words in a large list occur in a small string

Here's my shot at it:

def match_freq(exprs, strings)
rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {}
rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}}
[f.values.inject(0){|a,x|a+x}, f, f.size]
end

list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]

The output of "match_freq" is an array of your output items (a,b,c). The algorithm itself is O(n*m) where n is the number of items in list1 and m is the size of the input string, I don't think you can do better than that (in terms of big-oh). But there are smaller optimizations that might pay off like keeping a separate counter for the total number of matches instead of computing it afterwards. This was just my quick hack at it.

You can extract just the matching words from the output as follows:

matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"

Note that the order won't be preserved necessarily, if that's important you'll have to keep a separate list of the order they were found.

Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)

The python Set is what you should try.

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

Faster way to see if a huge list of strings is contained within another string

A benchmark between the Trie gem and Triez gem in this particular use case:

word count: 228982
user system total real
trie 13.410000 0.050000 13.460000 ( 13.463473)
triez 11.080000 0.010000 11.090000 ( 11.102195)
trie tail 39.920000 0.140000 40.060000 ( 40.102285)
triez tail 28.960000 0.030000 28.990000 ( 29.022630)

Generally, Triez is faster for the Op's use case.

require 'triez'
require 'trie'
require 'benchmark'

DICT = '/usr/share/dict/web2'

triez = Triez.new value_type: :object, default: nil
trie = Trie.new

count = 0
File.foreach(DICT) do |word|
word.chomp!
if word.size > 4
triez[word] = word
trie.add word
count += 1
end
end

puts "word count: #{count}"

def in_trie?(str, trie)
0.upto(str.length - 1) do |i|
node = trie.root
i.upto(str.length - 1) do |j|
break unless node.walk! str[j]
if node.terminal?
return str[i..j]
end
end
end
nil
end

def in_triez?(str, triez)
triez.change_all(:substring, str) do |v|
return v if v
end
nil
end

Benchmark.bm(12) do |b|
b.report('trie') do
1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
end
b.report('triez') do
1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
end
b.report('trie tail') do
1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
end
b.report('triez tail') do
1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
end
end

UPDATE benchmark for rambling-trie, where the lines with c prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)

Word count: 228982, ROUND: 100000
user system total real
trie 1.510000 0.000000 1.510000 ( 1.511772)
triez 1.170000 0.000000 1.170000 ( 1.176075)
rambling 4.800000 0.010000 4.810000 ( 4.847021)
c rambling 25.060000 0.050000 25.110000 ( 25.172771)
trie tail 4.540000 0.010000 4.550000 ( 4.566233)
triez tail 3.080000 0.010000 3.090000 ( 3.092655)
rambling tail 4.780000 0.010000 4.790000 ( 4.803114)
c rambling tail 23.470000 0.020000 23.490000 ( 23.525066)

It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.

class Rambling::Trie::Container
def match_prefix?(str)
root.match_prefix?(str.chars)
end
end

class Rambling::Trie::RawNode
def match_prefix?(chars, i = 0)
if children_tree.empty?
true
elsif i >= chars.size
false
else
letter = chars[i].to_sym
child = children_tree[letter]
!!child && child.match_prefix?(chars, i + 1)
end
end
end

class Rambling::Trie::CompressedNode
def match_prefix?(chars)
if children_tree.empty?
true
if chars.empty?
false
else
!!(recursive_get :match_prefix?, chars)
end
end
end

def in_r_trie?(str, r_trie)
0.upto(str.length - 1) do |i|
if r_trie.match_prefix? str[i..-1]
return true
end
end
false
end

Check if a list of words is in a string (Small Chatbot)

You can use Array.prototype.includes().

To match the whole string:

var helloWords = ["hello", "salut", "hi", "yo", "hey"];

var HowWords = ["how are you", "what's up", "how is it going", "how do you do"];

if (helloWords.includes(yourString.toLowerCase())) {
// Reply something
}
if (HowWords.includes(yourString.toLowerCase())) {

// Reply something else
}

To match partial string, you'll need to do something like this using Array.prototype.some():

var helloWords = ["hello", "salut", "hi", "yo", "hey"];

var HowWords = ["how are you", "what's up", "how is it going", "how do you do"];

if (helloWords.some( i => yourString.toLowerCase().includes(i) )) {
// Reply something
}
if (HowWords.some( i => yourString.toLowerCase().includes(i) )) {
// Reply something else
}

Algorithm to search for a list of words in a text

There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.

The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.

You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:

"dog|cat|horse|skunk"

And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.

There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.

If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.

Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:

"\b(cat|dog|horse|skunk)\b"

The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:

hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word

Or, if you want a list of matching words that are in the input text:

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

How to check if a list (string) contains another list (string) considering order

I believe that this answer should work if you just don't remove things from the sublist that aren't in the test list. So for the case of the first method provided there

def contains(testList, subList):
shared = [x for x in testList if x in subList]
return shared == subList

You could also convert the sublist to work with non-list inputs.

def contains(testList, subList):
shared = [x for x in testList if x in subList]
return shared == list(subList)

Search a list of strings for any sub-string from another list

In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

def check_data(data):
s = "\n".join(data);
for k in keywords:
if k in s:
return True

return False

In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)

Fastest way to check if a ListString contains a unique String

Your best bet is to use a HashSet and check if a string exists in the set via the contains() method. HashSets are built for fast access via the use of Object methods hashCode() and equals(). The Javadoc for HashSet states:

This class offers constant time performance for the basic operations (add, remove, contains and size),

HashSet stores objects in hash buckets which is to say that the value returned by the hashCode method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet has to perform via the equals() method is reduced to just the other Objects in the same hash bucket.

To use HashSets and HashMaps effectively, you must conform to the equals and hashCode contract outlined in the javadoc. In the case of java.lang.String these methods have already been implemented to do this.



Related Topics



Leave a reply



Submit