How to get words frequency in efficient way with ruby?
This works but I am kinda new to Ruby too. There might be a better solution.
def count_words(string)
words = string.split(' ')
frequency = Hash.new(0)
words.each { |word| frequency[word.downcase] += 1 }
return frequency
end
Instead of .split(' ')
, you could also do .scan(/\w+/)
; however, .scan(/\w+/)
would separate aren
and t
in "aren't"
, while .split(' ')
won't.
Output of your example code:
print count_words('I was 09809 home -- Yes! yes! You was');
#{"i"=>1, "was"=>2, "09809"=>1, "home"=>1, "yes"=>2, "you"=>1}
How can I refactor my word frequency method?
This is a bit cleaner, use multiline method chaining etc.
def frequencies(text)
words = text.split
the_frequencies = Hash.new(0)
words.each do |word|
the_frequencies[word] += 1
end
the_frequencies
end
def pre_process_file(file_name)
File.open(file_name.to_s)
.read.downcase.strip.split.join(" ")
.gsub(/[^a-zA-Z \'$]/, "")
.gsub(/'s/, "")
.split
end
def most_common_words(file_name, stop_words_file_name, number_of_word)
# TODO: return hash of occurences of number_of_word most frequent words
opened_file_string = pre_process_file(file_name)
opened_stop_file_string = pre_process_file(stop_words_file_name)
# declarar variables de file_name stop words.
filtered_array = opened_file_string
.reject { |n| opened_stop_file_string.include? n }
the_frequencies = Hash.new(0)
filtered_array.each { |word| the_frequencies[word] += 1 }
the_frequencies
.sort_by { |_k, value| value }
.reverse[0..number_of_word - 1]
.to_h
end
Finding words frequency of huge data in a database
Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.
What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):
map(doc):
for each word in doc:
emitIntermediate(word,"1")
reduce(list<word>):
emit(word,size(list))
The map reduce framework, which is implemented in many languages - allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.
In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.
Printing the n most frequent words in a file (string)
Two ways:
def word_counter(string, max)
string.split(/\s+/)
.group_by{|x|x}
.map{|x,y|[x,y.size]}
.sort_by{|_,size| size} # Have to sort =/
.last(max)
end
def word_counter(string, max)
# Create a Hash and a List to store values in.
word_counter, max_storage = Hash.new(0), []
#Split the string an and add each word to the hash:
string.split(/\s+/).each{|word| word_counter[word] += 1}
# Take each word and add it to the list (so that the list_index = word_count)
# I also add the count, but that is not really needed
word_counter.each{|key, val| max_storage[val] = [*max_storage[val]] << [key, val]}
# Higher count will always be at the end, remove nils and get the last "max" elements.
max_storage.compact.flatten(1).last(max)
end
Ruby: Frequency and Alphabetizing
The data
If you are reading the text from a file named "my_new_book", you can "gulp" the whole file as a string, referenced by the variable text
, like this:
text = File.read("my_new_text")
If you are not reading from a file, another way is to use a "here document", like this:
text =<<THE_END
It was the best
of times, it was
the worst of times
THE_END
#=> "It was the best\nof times, it was\nthe worst of times\n"
(with THE_END
starting at the beginning of the line).
Walking through your code
Let's start by making
STOP_WORDS = %w{a and any be by for in it of that the their they then }
a constant. (I dropped off a few to make it fit on one line.)
I was pleased to see that you created the array of stop words with %w
. That saves time, reduces errors and is more readable that having quotes around every word.
Next you have
word_arr = text.split
For the text in the here doc above,
text.split
#=> ["It", "was", "the", "best", "of", "times",
# "it", "was", "the", "worst", "of", "times"]
Notice that split
(same as text.split(/\s+/)
) splits the string on whitespace, not just spaces:
"lots of whitespace\n\n\n\n\here".split
#=> ["lots", "of", "whitespace", "here"]
Before we split
, we should first convert all the characters in text
to lower-case:
text.downcase
There are two reasons to do this. One, as @Steve mentioned in a comment, is that we want words like "we" and "We" to be treated as identical for the purposes of determining frequency. Secondly, we want to remove stop words that are capitalized.
Now we can split the string and put the individual words in an array:
word_arr = text.downcase.split
Your line
words = ""
does nothing, because it is followed by
words = word_arr
which overwrites ""
.
But why create words
when word_arr
is perfectly fine? So forget words
.
Your way of getting rid of the stop words is also very nice:
unique = words_arr - STOP_WORDS
But you completely undo that with
unique = words_arr
So get rid of that last statement. Also, unique
is not a very good name here because many of the words that are left are probably not unique. Maybe something like nonstop_words
. Hmmm. Maybe not. I'll leave that to you.
This is also very nice:
frequency = Hash.new(0)
unique.each { |word| frequency[word] +=1 }
But not this:
new_frequency = frequency.sort_by {|k,v| k }
(but you have the right idea with sort_by
) because that sorts on the keys, which are words. If you just wanted to sort on frequency, that would be:
new_frequency = frequency.sort_by {|k,v| v }
That gives you the least frequently-occurring words first. If you want the words that appear most frequently first (as I expect you do), you could write
new_frequency = frequency.sort_by {|k,v| v }.reverse
or
new_frequency = frequency.sort_by {|k,v| -v }
(Notice I'm saving to a new object--new_frequency
--that makes debugging a lot easier.)
We still haven't dealt with the problem of words that have the same frequency. You want those sorted alphabetically. That's not a problem because Ruby sorts arrays "lexicographically". When sorting an array, Ruby compares each pair of elements with the method Array#<=>. Please read that doc for an explanation.
The upshot is that we can sort the way you want like this:
new_frequency = frequency.sort_by {|k,v| [-v, k] }
(This assumes you want words appearing most frequently first.) When ordering two words, Ruby first gives preference to the smaller value of -v
(which is the bigger value of v
); if that's the same for both words, it goes to k
to break the tie.
Improving your code
There's one more thing that should be done, and that is to write this in a more Ruby-like way, by "chaining" the various methods we've used above. This is what we have (I've gone back to using words
rather than word_arr
):
words = text.downcase.split
unique = words-STOP_WORDS
frequency = Hash.new(0)
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
Now watch carefully as I pull the rabbit out of the hat. The above is the same as:
frequency = Hash.new(0)
unique = text.downcase.split-STOP_WORDS
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
frequency = Hash.new(0)
(text.downcase.split-STOP_WORDS).each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
frequency =
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
new_frequency =
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }.sort_by {|k,v| [-v, k] }
which we might wrap in a method:
def word_frequency(text)
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }.sort_by {|k,v| [-v, k] }
end
One the other hand, you might not want to chain everything and may prefer to write some or all blocks with do-end:
def word_frequency(text)
words = text.downcase.split-STOP_WORDS
words.each_with_object(Hash.new(0)) do |word,h|
h[word] +=1
end.sort_by { |k,v| [-v, k] }
end
That's entirely up to you.
If you have any problem following any of the last bits, not to worry. I just wanted to give you a flavor for the power of the language, to show you what you can look forward to as you gain experience.
Array to Hash : words count
The imperative approach you used is probably the fastest implementation in Ruby. With a bit of refactoring, you can write a one-liner:
wf = Hash.new(0).tap { |h| words.each { |word| h[word] += 1 } }
Another imperative approach using Enumerable#each_with_object
:
wf = words.each_with_object(Hash.new(0)) { |word, acc| acc[word] += 1 }
A functional/immutable approach using existing abstractions:
wf = words.group_by(&:itself).map { |w, ws| [w, ws.length] }.to_h
Note that this is still O(n) in time, but it traverses the collection three times and creates two intermediate objects along the way.
Finally: a frequency counter/histogram is a common abstraction that you'll find in some libraries like Facets: Enumerable#frequency.
require 'facets'
wf = words.frequency
Ruby Text Analysis
the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
Check the following thread, which contains more details and links:
Building openears compatible language model
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92's post lists some more Ruby NLP resources.
You can also Google for "ARPA Language Model" for more info
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!
Counting frequency of symbols
You won't need a range when you're trying to count every possible character, because every possible character is a domain. You should only create a range when you specifically need to use a subset of said domain.
This is probably a faster implementation that counts all characters in the file:
def char_frequency(file_name)
ret_val = Hash.new(0)
File.open(file_name) {|file| file.each_char {|char| ret_val[char] += 1 } }
ret_val
end
p char_frequency("1003v-mm") #=> {"\r"=>56, "\n"=>56, " "=>2516, "\xC9"=>2, ...
For reference I used this test file.
Related Topics
How to Transfer Files Using Ssh and Scp Using Ruby Calls
How to Timeout Flash Messages in Rails
Ruby Gems in Stand-Alone Ruby Scripts
Carrierwave Fog Amazon S3 Images Not Displaying
Rails 4 Update Nested Attributes
How to Create a Form in Rails Without Having to Use Form_For and a Model Instance
Ruby - How to Write a New File with Output from Script
Ruby Array to Hash: Each Element the Key and Derive Value from It
Run a Ruby Library from the Command-Line
How to Disable a Form Submit Button "A Là Ruby on Rails Way"
Implementing Bayesian Classifier in Ruby
How to Test If All Items in an Array Are Identical
Are There Any Ruby Orms Which Use Cursors or Smart Fetch
Is the Unix Philosophy Falling Out of Favor in the Ruby Community
Openssl Error Installing Ruby 2.0.0-P195 on MAC with Rbenv