How to Get Words Frequency in Efficient Way with Ruby

How to get words frequency in efficient way with ruby?

This works but I am kinda new to Ruby too. There might be a better solution.

def count_words(string)
  words = string.split(' ')
  frequency = Hash.new(0)
  words.each { |word| frequency[word.downcase] += 1 }
  return frequency
end

Instead of .split(' '), you could also do .scan(/\w+/); however, .scan(/\w+/) would separate aren and t in "aren't", while .split(' ') won't.

Output of your example code:

print count_words('I was 09809 home -- Yes! yes!  You was');

#{"i"=>1, "was"=>2, "09809"=>1, "home"=>1, "yes"=>2, "you"=>1}

How can I refactor my word frequency method?

This is a bit cleaner, use multiline method chaining etc.

def frequencies(text)
  words = text.split
  the_frequencies = Hash.new(0)
  words.each do |word|
    the_frequencies[word] += 1
  end
  the_frequencies
end

def pre_process_file(file_name)
  File.open(file_name.to_s)
      .read.downcase.strip.split.join(" ")
      .gsub(/[^a-zA-Z \'$]/, "")
      .gsub(/'s/, "")
      .split
end

def most_common_words(file_name, stop_words_file_name, number_of_word)
  # TODO: return hash of occurences of number_of_word most frequent words
  opened_file_string = pre_process_file(file_name)
  opened_stop_file_string = pre_process_file(stop_words_file_name)
  # declarar variables de file_name stop words.
  filtered_array = opened_file_string
                    .reject { |n| opened_stop_file_string.include? n }

  the_frequencies = Hash.new(0)
  filtered_array.each { |word| the_frequencies[word] += 1 }
  the_frequencies
    .sort_by { |_k, value| value }
    .reverse[0..number_of_word - 1]
    .to_h
end

Finding words frequency of huge data in a database

Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.

What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):

map(doc):
  for each word in doc:
      emitIntermediate(word,"1")
reduce(list<word>):
  emit(word,size(list))

The map reduce framework, which is implemented in many languages - allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.

In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.

Printing the n most frequent words in a file (string)

Two ways:

def word_counter(string, max)
  string.split(/\s+/)
        .group_by{|x|x}
        .map{|x,y|[x,y.size]} 
        .sort_by{|_,size| size} # Have to sort =/
        .last(max)
end

def word_counter(string, max)

  # Create a Hash and a List to store values in.
  word_counter, max_storage = Hash.new(0), []

  #Split the string an and add each word to the hash:
  string.split(/\s+/).each{|word| word_counter[word] += 1}

  # Take each word and add it to the list (so that the list_index = word_count)
  # I also add the count, but that is not really needed
  word_counter.each{|key, val| max_storage[val] = [*max_storage[val]] << [key, val]}

  # Higher count will always be at the end, remove nils and get the last "max" elements.
  max_storage.compact.flatten(1).last(max)

end

Ruby: Frequency and Alphabetizing

The data

If you are reading the text from a file named "my_new_book", you can "gulp" the whole file as a string, referenced by the variable text, like this:

text = File.read("my_new_text")

If you are not reading from a file, another way is to use a "here document", like this:

text =<<THE_END
It was the best
of times, it was
the worst of times
THE_END
  #=> "It was the best\nof times, it was\nthe worst of times\n"

(with THE_END starting at the beginning of the line).

Walking through your code

Let's start by making

STOP_WORDS = %w{a and any be by for in it of that the their they then }

a constant. (I dropped off a few to make it fit on one line.)

I was pleased to see that you created the array of stop words with %w. That saves time, reduces errors and is more readable that having quotes around every word.

Next you have

word_arr = text.split

For the text in the here doc above,

text.split
  #=> ["It", "was", "the", "best", "of", "times",
  #    "it", "was", "the", "worst", "of", "times"]

Notice that split (same as text.split(/\s+/)) splits the string on whitespace, not just spaces:

"lots    of whitespace\n\n\n\n\here".split
  #=> ["lots", "of", "whitespace", "here"]

Before we split, we should first convert all the characters in text to lower-case:

text.downcase

There are two reasons to do this. One, as @Steve mentioned in a comment, is that we want words like "we" and "We" to be treated as identical for the purposes of determining frequency. Secondly, we want to remove stop words that are capitalized.

Now we can split the string and put the individual words in an array:

word_arr = text.downcase.split

Your line

words = ""

does nothing, because it is followed by

words = word_arr

which overwrites "".

But why create words when word_arr is perfectly fine? So forget words.

Your way of getting rid of the stop words is also very nice:

unique = words_arr - STOP_WORDS

But you completely undo that with

unique = words_arr

So get rid of that last statement. Also, unique is not a very good name here because many of the words that are left are probably not unique. Maybe something like nonstop_words. Hmmm. Maybe not. I'll leave that to you.

This is also very nice:

frequency = Hash.new(0) 
unique.each { |word| frequency[word] +=1 }

But not this:

new_frequency = frequency.sort_by {|k,v| k }

(but you have the right idea with sort_by) because that sorts on the keys, which are words. If you just wanted to sort on frequency, that would be:

new_frequency = frequency.sort_by {|k,v| v }

That gives you the least frequently-occurring words first. If you want the words that appear most frequently first (as I expect you do), you could write

new_frequency = frequency.sort_by {|k,v| v }.reverse

new_frequency = frequency.sort_by {|k,v| -v }

(Notice I'm saving to a new object--new_frequency--that makes debugging a lot easier.)

We still haven't dealt with the problem of words that have the same frequency. You want those sorted alphabetically. That's not a problem because Ruby sorts arrays "lexicographically". When sorting an array, Ruby compares each pair of elements with the method Array#<=>. Please read that doc for an explanation.

The upshot is that we can sort the way you want like this:

new_frequency = frequency.sort_by {|k,v| [-v, k] }

(This assumes you want words appearing most frequently first.) When ordering two words, Ruby first gives preference to the smaller value of -v (which is the bigger value of v); if that's the same for both words, it goes to k to break the tie.

Improving your code

There's one more thing that should be done, and that is to write this in a more Ruby-like way, by "chaining" the various methods we've used above. This is what we have (I've gone back to using words rather than word_arr):

words = text.downcase.split
unique = words-STOP_WORDS
frequency = Hash.new(0) 
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

Now watch carefully as I pull the rabbit out of the hat. The above is the same as:

frequency = Hash.new(0) 
unique = text.downcase.split-STOP_WORDS
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

frequency = Hash.new(0) 
(text.downcase.split-STOP_WORDS).each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

frequency =
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
    h[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

new_frequency =
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
    h[word] +=1 }.sort_by {|k,v| [-v, k] }

which we might wrap in a method:

def word_frequency(text)
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
  h[word] +=1 }.sort_by {|k,v| [-v, k] }
end

One the other hand, you might not want to chain everything and may prefer to write some or all blocks with do-end:

def word_frequency(text)
  words = text.downcase.split-STOP_WORDS
  words.each_with_object(Hash.new(0)) do |word,h| 
    h[word] +=1
  end.sort_by { |k,v| [-v, k] }
end

That's entirely up to you.

If you have any problem following any of the last bits, not to worry. I just wanted to give you a flavor for the power of the language, to show you what you can look forward to as you gain experience.

Array to Hash : words count

The imperative approach you used is probably the fastest implementation in Ruby. With a bit of refactoring, you can write a one-liner:

wf = Hash.new(0).tap { |h| words.each { |word| h[word] += 1 } }

Another imperative approach using Enumerable#each_with_object:

wf = words.each_with_object(Hash.new(0)) { |word, acc| acc[word] += 1 }

A functional/immutable approach using existing abstractions:

wf = words.group_by(&:itself).map { |w, ws| [w, ws.length] }.to_h

Note that this is still O(n) in time, but it traverses the collection three times and creates two intermediate objects along the way.

Finally: a frequency counter/histogram is a common abstraction that you'll find in some libraries like Facets: Enumerable#frequency.

require 'facets'
wf = words.frequency

Ruby Text Analysis

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

Check the following thread, which contains more details and links:

Building openears compatible language model

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92's post lists some more Ruby NLP resources.

You can also Google for "ARPA Language Model" for more info

Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!

Counting frequency of symbols

You won't need a range when you're trying to count every possible character, because every possible character is a domain. You should only create a range when you specifically need to use a subset of said domain.

This is probably a faster implementation that counts all characters in the file:

def char_frequency(file_name)
  ret_val = Hash.new(0)
  File.open(file_name) {|file| file.each_char {|char| ret_val[char] += 1 } }
  ret_val
end

p char_frequency("1003v-mm")  #=>  {"\r"=>56, "\n"=>56, " "=>2516, "\xC9"=>2, ...

For reference I used this test file.

How to Get Words Frequency in Efficient Way with Ruby