Ruby: How to Count the Number of Times a String Appears in Another String

Ruby: How to count the number of times a string appears in another string?

Here are a couple of ways to count the numbers of times a given substring appears in a string (the first being my preference). Note (as confirmed by the OP) the substring 'aa' appears twice in the string 'aaa', and therefore five times in:

str = "aaabbccaaaaddbab"

Use String#scan with a regex that contains a positive lookahead that looks for the substring:

def count_em(str, substr)
  str.scan(/(?=#{substr})/).count
end

count_em(str,"aa")
  #=> 5
count_em(str,"ab")
  #=> 2

Note:

"aaabbccaaaaddbab".scan(/(?=aa)/)
  #=> ["", "", "", "", ""]

A positive lookbehind produces the same result:

"aaabbccaaaaddbab".scan(/(?<=aa)/)
  #=> ["", "", "", "", ""]

As well, String#scan could be replaced with the form of String#gsub that takes one argument (here the same regular expression) and no block, and returns an enumerator. That form of gsub in unusual in that has nothing to do with character replacement; it simply generates matches of the regular expression.

Convert to an array, apply String#each_char then Enumerable#each_cons, then Enumerable#count:

def count_em(str, substr)
  subarr = substr.chars
  str.each_char
     .each_cons(substr.size)
     .count(subarr)
end

count_em(str,"aa")
  #=> 5
count_em(str,"ab")
  #=> 2

We have:

subarr = "aa".chars
  #=> ["a", "a"]
enum0 = "aaabbccaaaaddbab".each_char
  #=> #<Enumerator: "aaabbccaaaaddbab":each_char>

We can see the elements that will generated by this enumerator by converting it to an array:

enum0.to_a
  #=> ["a", "a", "a", "b", "b", "c", "c", "a", "a", "a",
  #    "a", "d", "d", "b", "a", "b"]

enum1 = enum0.each_cons("aa".size)
  #=> #<Enumerator: #<Enumerator:
  #      "aaabbccaaaaddbab":each_char>:each_cons(2)>

Convert enum1 to an array to see what values the enumerator will pass on to map:

enum1.to_a
  #=> [["a", "a"], ["a", "a"], ["a", "b"], ["b", "b"], ["b", "c"],
  #    ["c", "c"], ["c", "a"], ["a", "a"], ["a", "a"], ["a", "a"], 
  #    ["a", "d"], ["d", "d"], ["d", "b"], ["b", "a"],
  #    ["a", "b"]]
 
enum1.count(subarr)
  #=> enum1.count(["a", "a"])
  #=> 5

How to count a string elements' occurrence in another string in ruby?

Code

def count_em(str, target)
  target.chars.uniq.map { |c| str.count(c)/target.count(c) }.min
end

Examples

count_em "I love donuts!", "donuts"                      #=> 1
count_em "Squirrels do love nuts", "donuts"              #=> 1
count_em "donuts do stun me", "donuts"                   #=> 2
count_em "donuts and nuts sound too delicious", "donuts" #=> 3
count_em "cats have nine lives", "donuts"                #=> 0
count_em "feeding force scout", "coffee"                 #=> 1
count_em "feeding or scout", "coffee"                    #=> 0

str = ("free mocha".chars*4).shuffle.join
  # => "hhrefemcfeaheomeccrmcre eef oa ofrmoaha "
count_em str, "free mocha"
  #=> 4

Explanation

For

str = "feeding force scout"
target = "coffee"

a = target.chars
  #=> ["c", "o", "f", "f", "e", "e"] 
b = a.uniq
  #=> ["c", "o", "f", "e"] 
c = b.map { |c| str.count(c)/target.count(c) }
  #=> [2, 2, 1, 1] 
c.min
  #=> 1

In calculating c, consider the first element of b passed to the block and assigned to the block variable c.

c = "c"

Then the block calculation is

d = str.count(c)
  #=> 2 
e = target.count(c)
  #=> 1
d/e
  #=> 2

This indicates that str contains enough "c"'s to match "coffee" twice.

The remaining calculations to obtain c are similar.

Addendum

If the characters of str matching characters target must be in the same order as those of target, the following regex could be used.

target = "coffee"

r = /#{ target.chars.join(".*?") }/i
  #=> /c.*?o.*?f.*?f.*?e.*?e/i

matches = "xcorr fzefe yecaof tfe erg eeffoc".scan(r)
  #=> ["corr fzefe ye", "caof tfe e"]
matches.size
  #=> 2

"feeding force scout".scan(r).size
  #=> 0

The questions marks in the regex are needed to make the searches non-greedy.

Finding # occurrences of a character in a string in Ruby

I was able to solve this by passing a string through scan as shown in another answer.

For example:

string = 'This is an example'
puts string.count('e')

Outputs:

I was also able to pull the occurrences by using scan and passing a sting through instead of regex which varies slightly from another answer but was helpful in order to avoid regex.

string = 'This is an example'
puts string.scan('e')

Outputs:

['e','e']

I explored these methods further in a guide I created after I figured it out.

How can I get the number of times a substring appears in a text

'aa_bb_cc_dd_eeeee_ff'.scan(/(?=ee)/).length
# => 4

How to count occurrences of a substring within string fast with Ruby

I think you could approach this problem differently

You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.

You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.

EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.

Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...

You should also hold the position you found of your smaller regex so it wouldn't repeat it.

Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.

Ruby: How to Count the Number of Times a String Appears in Another String