Find Out Which Words in a Large List Occur in a Small String

Find out which words in a large list occur in a small string

Here's my shot at it:

def match_freq(exprs, strings)
  rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {}
  rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}}
  [f.values.inject(0){|a,x|a+x}, f, f.size]
end

list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]

The output of "match_freq" is an array of your output items (a,b,c). The algorithm itself is O(n*m) where n is the number of items in list1 and m is the size of the input string, I don't think you can do better than that (in terms of big-oh). But there are smaller optimizations that might pay off like keeping a separate counter for the total number of matches instead of computing it afterwards. This was just my quick hack at it.

You can extract just the matching words from the output as follows:

matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"

Note that the order won't be preserved necessarily, if that's important you'll have to keep a separate list of the order they were found.

Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)

The python Set is what you should try.

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

Faster way to see if a huge list of strings is contained within another string

A benchmark between the Trie gem and Triez gem in this particular use case:

word count: 228982
                   user     system      total        real
trie          13.410000   0.050000  13.460000 ( 13.463473)
triez         11.080000   0.010000  11.090000 ( 11.102195)
trie tail     39.920000   0.140000  40.060000 ( 40.102285)
triez tail    28.960000   0.030000  28.990000 ( 29.022630)

Generally, Triez is faster for the Op's use case.

require 'triez'
require 'trie'
require 'benchmark'

DICT = '/usr/share/dict/web2'

triez = Triez.new value_type: :object, default: nil
trie = Trie.new

count = 0
File.foreach(DICT) do |word|
  word.chomp!
  if word.size > 4
    triez[word] = word
    trie.add word
    count += 1
  end
end

puts "word count: #{count}"

def in_trie?(str, trie)
  0.upto(str.length - 1) do |i|
    node = trie.root
    i.upto(str.length - 1) do |j|
      break unless node.walk! str[j]
      if node.terminal?
        return str[i..j]
      end
    end
  end
  nil
end

def in_triez?(str, triez)
  triez.change_all(:substring, str) do |v|
    return v if v
  end
  nil
end

Benchmark.bm(12) do |b|
  b.report('trie') do
    1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
  end
  b.report('triez') do
    1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
  end
  b.report('trie tail') do
    1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
  end
  b.report('triez tail') do
    1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
  end
end

UPDATE benchmark for rambling-trie, where the lines with c prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)

Word count: 228982, ROUND: 100000
                      user     system      total        real
trie              1.510000   0.000000   1.510000 (  1.511772)
triez             1.170000   0.000000   1.170000 (  1.176075)
rambling          4.800000   0.010000   4.810000 (  4.847021)
c rambling       25.060000   0.050000  25.110000 ( 25.172771)
trie tail         4.540000   0.010000   4.550000 (  4.566233)
triez tail        3.080000   0.010000   3.090000 (  3.092655)
rambling tail     4.780000   0.010000   4.790000 (  4.803114)
c rambling tail  23.470000   0.020000  23.490000 ( 23.525066)

It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.

class Rambling::Trie::Container
  def match_prefix?(str)
    root.match_prefix?(str.chars)
  end
end

class Rambling::Trie::RawNode
  def match_prefix?(chars, i = 0)
    if children_tree.empty?
      true
    elsif i >= chars.size
      false
    else
      letter = chars[i].to_sym
      child = children_tree[letter]
      !!child && child.match_prefix?(chars, i + 1)
    end
  end
end

class Rambling::Trie::CompressedNode
  def match_prefix?(chars)
    if children_tree.empty?
      true
    if chars.empty?
      false
    else
      !!(recursive_get :match_prefix?, chars)
    end
  end
end

def in_r_trie?(str, r_trie)
  0.upto(str.length - 1) do |i|
    if r_trie.match_prefix? str[i..-1]
      return true
    end
  end
  false
end

Check if a list of words is in a string (Small Chatbot)

You can use Array.prototype.includes().

To match the whole string:

var helloWords = ["hello", "salut", "hi", "yo", "hey"];

var HowWords = ["how are you", "what's up", "how is it going", "how do you do"];

if (helloWords.includes(yourString.toLowerCase())) {
    // Reply something
}
if (HowWords.includes(yourString.toLowerCase())) {

    // Reply something else
}

To match partial string, you'll need to do something like this using Array.prototype.some():

var helloWords = ["hello", "salut", "hi", "yo", "hey"];

var HowWords = ["how are you", "what's up", "how is it going", "how do you do"];

if (helloWords.some( i => yourString.toLowerCase().includes(i) )) {
    // Reply something
}
if (HowWords.some( i => yourString.toLowerCase().includes(i) )) {
    // Reply something else
}

Algorithm to search for a list of words in a text

There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.

The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.

You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:

"dog|cat|horse|skunk"

And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.

There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.

If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.

Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:

"\b(cat|dog|horse|skunk)\b"

The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:

hashTable = Build hash table from target words
for each word in input text
    if word in hashTable then
        output word

Or, if you want a list of matching words that are in the input text:

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
    if word in hashTable then
        add word to foundWords

How to check if a list (string) contains another list (string) considering order

I believe that this answer should work if you just don't remove things from the sublist that aren't in the test list. So for the case of the first method provided there

def contains(testList, subList):
   shared = [x for x in testList if x in subList]
   return shared == subList

You could also convert the sublist to work with non-list inputs.

def contains(testList, subList):
   shared = [x for x in testList if x in subList]
   return shared == list(subList)

Search a list of strings for any sub-string from another list

In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

def check_data(data):
    s = "\n".join(data);
    for k in keywords:
        if k in s:
            return True

    return False

In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)

Fastest way to check if a ListString contains a unique String

Your best bet is to use a HashSet and check if a string exists in the set via the contains() method. HashSets are built for fast access via the use of Object methods hashCode() and equals(). The Javadoc for HashSet states:

This class offers constant time performance for the basic operations (add, remove, contains and size),

HashSet stores objects in hash buckets which is to say that the value returned by the hashCode method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet has to perform via the equals() method is reduced to just the other Objects in the same hash bucket.

To use HashSets and HashMaps effectively, you must conform to the equals and hashCode contract outlined in the javadoc. In the case of java.lang.String these methods have already been implemented to do this.

Find Out Which Words in a Large List Occur in a Small String