How to Write a Method That Counts the Most Common Substring in a String in Ruby

How to write a method that counts the most common substring in a string in ruby?

Code

def most_frequent_substrings(str, k)
(0..str.size-k).each_with_object({}) do |i,h|
b = []
str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
(h[b.size] ||= []) << b
end.max_by(&:first).last.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
end

Example

str = "ABBABABBABCATSABBABB"
most_frequent_substrings(str, 4)
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}

This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.

Explanation

For the example above the steps are as follows.

k = 4
n = str.size - k
#=> 20 - 4 => 16
e = (0..n).each_with_object([])
#<Enumerator: 0..16:each_with_object([])>

We can see the values that will be generated by this enumerator by converting it to an array.

e.to_a
#=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
# [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]

Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:

i,a = e.next
#=> [0, []]
i #=> 0
a #=> []

We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.

b = []
r = Regexp.new str[i,k]
#=> Regexp.new str[0,4]
#=> Regexp.new "ABBA"
#=> /ABAB/
str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
#=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i }
b #=> [0, 5, 14]

We next have

(h[b.size] ||= []) << b

which becomes

(h[b.size] = h[b.size] || []) << b
#=> (h[3] = h[3] || []) << [0, 5, 14]

Since h has no key 3, h[3] on the right side equals nil. Continuing,

  #=> (h[3] = nil || []) <<  [0, 5, 14]
#=> (h[3] = []) << [0, 5, 14]
h #=> { 3=>[[0, 5, 14]] }

Notice that we throw away scan's return value. All we need is b

This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.

Now observe

e.to_a
#=> [[0, [[0, 5, 14]]], [1, [[0, 5, 14]]], [2, [[0, 5, 14]]],
# ...
# [16, [[0, 5, 14]]]]

After all elements of e have been passed to the block, the block returns

h #=> {3=>[[0, 5, 14], [1, 6, 15]],
# 1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
# 2=>[[4, 16], [5, 14], [6, 15]]}

Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:

str[2,4]
#=> "BABA"

That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].

Next we determine the greatest frequency of a substring of length 4:

c = h.max_by(&:first)
#=> [3, [[0, 5, 14], [1, 6, 15]]]

(which could also be written c = h.max_by { |k,_| k }).

d = c.last
#=> [[0, 5, 14], [1, 6, 15]]

For convenience, convert d to a hash:

d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}

and return that hash from the method.

There is one detail that deserves mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.

str = "AAA"
k = 2

In this case the array d above will equal

d = [[0], [1]]

Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:

d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"AA"=>[1]}

How to write a method that finds the most common letter in a string?

I suggest you use a counting hash:

str = "The quick brown dog jumped over the lazy fox."

str.downcase.gsub(/[^a-z]/,'').
each_char.
with_object(Hash.new(0)) { |c,h| h[c] += 1 }.
max_by(&:last)
#=> ["e",4]

Hash::new with an argument of zero creates an empty hash whose default value is zero.

The steps:

s = str.downcase.gsub(/[^a-z]/,'')
#=> "thequickbrowndogjumpedoverthelazyfox"
enum0 = s.each_char
#=> #<Enumerator: "thequickbrowndogjumpedoverthelazyfox":each_char>
enum1 = enum0.with_object(Hash.new(0))
#=> #<Enumerator: #<Enumerator:
# "thequickbrowndogjumpedoverthelazyfox":each_char>:with_object({})>

You can think of enum1 as a "compound" enumerator. (Study the return value above.)

Let's see the elements of enum1:

enum1.to_a
#=> [["t", {}], ["h", {}], ["e", {}], ["q", {}],..., ["x", {}]]

The first element of enum1 (["t", {}]) is passed to the block by String#each_char and assigned to the block variables:

c,h = enum1.next
#=> ["t", {}]
c #=> "t"
h #=> {}

The block calculation is then performed:

h[c] += 1
#=> h[c] = h[c] + 1
#=> h["t"] = h["t"] + 1
#=> h["t"] = 0 + 1 #=> 1
h #=> {"t"=>1}

Ruby expands h[c] += 1 to h[c] = h[c] + 1, which is h["t"] = h["t"] + 1 As h #=> {}, h has no key "t", so h["t"] on the right side of the equal sign is replaced by the hash's default value, 0. The next time c #=> "t", h["t"] = h["t"] + 1 will reduce to h["t"] = 1 + 1 #=> 2 (i.e., the default value will not be used, as h now has a key "t").

The next value of enum1 is then passed into the block and the block calculation is performed:

c,h = enum1.next
#=> ["h", {"t"=>1}]
h[c] += 1
#=> 1
h #=> {"t"=>1, "h"=>1}

The remaining elements of enum1 are processed similarly.

Most common words in string

Your method of counting instances of the word is your problem. it is in with, so it's double counted.

[1] pry(main)> 'with some words in it'.scan('it')
=> ["it", "it"]

It can be done easier though, you can group an array's contents by the number of instances of the values using an each_with_object call, like so:

counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }

This goes through each entry in the array and adds 1 to the value for each word's entry in the hash.

So the following should work for you:

def most_common(string)
words = string.downcase.tr(",.?!",'').split(' ')
counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
max_quantity = counts.values.max
counts.select { |k, v| v == max_quantity }.keys
end

p most_common('a short list of words with some words') #['words']
p most_common('Words in a short, short words, lists of words!') #['words']
p most_common('a short list of words with some short words in it') #['words', 'short']

Find most common string in an array

Ruby < 2.2

#!/usr/bin/ruby1.8

def most_common_value(a)
a.group_by do |e|
e
end.values.max_by(&:size).first
end

x = ["1.111", "1.122", "1.250", "1.111"]
p most_common_value(x) # => "1.111"

Note: Enumberable.max_by is new with Ruby 1.9, but it has been backported to 1.8.7

Ruby >= 2.2

Ruby 2.2 introduces the Object#itself method, with which we can make the code more concise:

def most_common_value(a)
a.group_by(&:itself).values.max_by(&:size).first
end

As a monkey patch

Or as Enumerable#mode:

Enumerable.class_eval do
def mode
group_by do |e|
e
end.values.max_by(&:size).first
end
end

["1.111", "1.122", "1.250", "1.111"].mode
# => "1.111"

Find the longest common starting substring in a set of strings

It's a matter of taste, but this is a simple javascript version:
It sorts the array, and then looks just at the first and last items.

//longest common starting substring in an array

function sharedStart(array){
var A= array.concat().sort(),
a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
return a1.substring(0, i);
}

DEMOS

sharedStart(['interspecies', 'interstelar', 'interstate'])  //=> 'inters'
sharedStart(['throne', 'throne']) //=> 'throne'
sharedStart(['throne', 'dungeon']) //=> ''
sharedStart(['cheese']) //=> 'cheese'
sharedStart([]) //=> ''
sharedStart(['prefix', 'suffix']) //=> ''

Any Framework functions helping to find the longest common starting substring of multiple strings?

If anyone is interested, here's what I came up with:

    public static string GetCommonStartingSubString(IList<string> strings)
{
if (strings.Count == 0)
return "";
if (strings.Count == 1)
return strings[0];
int charIdx = 0;
while (IsCommonChar(strings, charIdx))
++charIdx;
return strings[0].Substring(0, charIdx);
}
private static bool IsCommonChar(IList<string> strings, int charIdx)
{
if(strings[0].Length <= charIdx)
return false;
for (int strIdx = 1; strIdx < strings.Count; ++strIdx)
if (strings[strIdx].Length <= charIdx
|| strings[strIdx][charIdx] != strings[0][charIdx])
return false;
return true;
}

Finding # occurrences of a character in a string in Ruby

I was able to solve this by passing a string through scan as shown in another answer.

For example:

string = 'This is an example'
puts string.count('e')

Outputs:

2

I was also able to pull the occurrences by using scan and passing a sting through instead of regex which varies slightly from another answer but was helpful in order to avoid regex.

string = 'This is an example'
puts string.scan('e')

Outputs:

['e','e']

I explored these methods further in a guide I created after I figured it out.

How to (fastly) get a substring according to a specific pattern in python?

In addition to taking advantage of using types as suggested by @DavidW, using the appropriate data structures is important. Python's container objects like list are slow since they are not contiguous in memory, and they should be avoided when writing performance-minded cython code.

In this case, we can also save some of those for loops as well. Instead of iterating over theta twice for both the s1 and the s2 sublists, we can handle both strings at the same time. We can also compare the characters one by one, and then break out of the comparison early as soon as we hit the first character/nucleotide that is not matching.

Below is my cython version of your code, which should runs well over an order of magnitude faster than the question's code (about 1 second for 1 million iterations versus 40 seconds or so). I have added some comments that hopefully will be helpful. As for your concern about C string management, at least for simple one-off functions like this, as long as you call an appropriate free for each call to malloc, you should be fine. Since no char* had to be dynamically allocated directly at the C level via such means, there is no need for worrying about memory management here. Also, indexing into char* rather than str, especially in a tight loop like this, can avoid some slight python overhead.

from libc.stdint cimport int8_t
cimport cython

"""
I have included three @cython decorators here to save some checks.
The first one, boundscheck is the only useful one in this scenario, actually.
You can see their effect if you generate cython annotations!
Read more about it here:
http://cython.readthedocs.io/en/latest/src/reference/compilation.html
"""
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def fast_count_substr_matches(str s1, str s2, int k, int8_t[:] theta):
cdef int i, j, m#you used k, unfortunately, so stuck with m...
cdef bytes b1 = s1.encode("utf-8")
#alternatively, could just pass in bytes strings instead
#by prefixing your string literals like b'AAATCGGGT'
cdef bytes b2 = s2.encode("utf-8")
cdef char* c1 = b1
cdef char* c2 = b2
#python str objects have minor overhead when accessing them with str[index]
#this is why I bother converting them to char* at the start
cdef int count = 0
cdef bint comp#A C-type int that can be treated as python bool nicely

for i in range(len(s1) - k + 1):
for j in range(len(s2) - k + 1):
comp = True
for m in range(k):
if theta[m] == True and c1[i + m] != c2[j + m]:
comp = False
break
if comp:
count += 1
return count

Please let me know if there is anything in this answer that could be cleared up. Hope this helps!



Related Topics



Leave a reply



Submit