Array to Hash:Words Count

Array to Hash : words count

The imperative approach you used is probably the fastest implementation in Ruby. With a bit of refactoring, you can write a one-liner:

wf = Hash.new(0).tap { |h| words.each { |word| h[word] += 1 } }

Another imperative approach using Enumerable#each_with_object:

wf = words.each_with_object(Hash.new(0)) { |word, acc| acc[word] += 1 }

A functional/immutable approach using existing abstractions:

wf = words.group_by(&:itself).map { |w, ws| [w, ws.length] }.to_h

Note that this is still O(n) in time, but it traverses the collection three times and creates two intermediate objects along the way.

Finally: a frequency counter/histogram is a common abstraction that you'll find in some libraries like Facets: Enumerable#frequency.

require 'facets'
wf = words.frequency

Count instances of string in a very large array and add the value to a hash value

In first turn you can create an object with the array value and number of occurrence.Then loop through it to create an array of objects





var words = ["a", "hello", "hello", "b", "went", "a"];

var rObj = {};

var finalArray = [];

words.map(function(currentValue, index) {

if (rObj.hasOwnProperty(currentValue)) {

rObj[currentValue] = rObj[currentValue] + 1;

} else {

rObj[currentValue] = 1

}


});

for (var keys in rObj) {

var obj = {};

obj[keys] = rObj[keys];

finalArray.push(obj)

};

console.log(finalArray)

Multiple Hash Tables for the Word Count Project

You are using a very bad hash function (adding all characters), that's why you get so many collisions and your Insert method calls itself so many times as a result.

For a detailed overview of different hash functions see the answer to this question. I suggest you try DJB2 or FNV-1a (which is used in some implementations of std::unordered_map).

You should also use more localized "probes" for the empty place to improve cache-locality and use a loop instead of recursion in your Insert method.

But first I suggest you tweak your HashEntry a little:

class HashEntry
{
public:
string key; // the word is actually a key, no need to store hash value
size_t value; // the word count is the value.
HashEntry(string key)
: key(std::move(key)), value(1) // move the string to avoid unnecessary copying
{ }
};

Then let's try to use a better hash function:

// DJB2 hash-function
size_t Hash(const string &key)
{
size_t hash = 5381;
for (auto &&c : key)
hash = ((hash << 5) + hash) + c;
return hash;
}

Then rewrite the Insert function:

void Insert(string key)
{
size_t index = Hash(key) % TABLE_SIZE;

while (table[index] != nullptr) {
if (table[index]->key == key) {
++table[index]->value;
return;
}
++index;
if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
index = 0;
}

table[index] = new HashEntry(std::move(key));
}

To find the hash table entry by key you can use a similar approach:

HashEntry *Find(const string &key)
{
size_t index = Hash(key) % TABLE_SIZE;

while (table[index] != nullptr) {
if (table[index]->key == key) {
return table[index];
}
++index;
if (index == TABLE_SIZE)
index = 0;
}

return nullptr;
}

Optimizing word count

I think a trie with the count as the leaves could be faster.

Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.

A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.

For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.

If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.

However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

Create hash from array and frequency

Do as below :

def score( array )
hash = Hash.new(0)
array.each{|key| hash[key] += 1}
hash
end
score([1,2,4,5,4,7]) # => {1=>1, 2=>1, 4=>2, 5=>1, 7=>1}

Or more Rubyish using Enumerable#each_with_object:

def score( array )
array.each_with_object(Hash.new(0)){|key,hash| hash[key] += 1}
end
score([1,2,4,5,4,7]) # => {1=>1, 2=>1, 4=>2, 5=>1, 7=>1}

The reason of why NoMethodError: undefined method '+' for nil:NilClass ?

hash = {} is an empty has,with default value as nil.nil is an instance of Nilclass,and NilClass doesn't have any instance method called #+. So you got NoMethodError.

Look at the Hash::new documentation :

new → new_hash
new(obj) → new_hash

Returns a new, empty hash. If this hash is subsequently accessed by a key that doesn’t correspond to a hash entry, the value returned depends on the style of new used to create the hash. In the first form, the access returns nil. If obj is specified, this single object will be used for all default values. If a block is specified, it will be called with the hash object and the key, and should return the default value. It is the block’s responsibility to store the value in the hash if required.

how to convert Array to Hash: integer count

This logic is incorrect:

time.each do |time|
@array = Array(DateTime.strptime(time,'%m/%d/%Y %H:%M').hour)
end

The body of the loop is replacing the contents of @array every time, so you only get the last element when you're done. Also, using time as the inner variable name is confusing (and destructive in older versions of Ruby).

What you want is to append to the array with << inside the loop, instead of assigning to it with =:

@array = []
time.each do |t|
@array << DateTime.strptime(t, '%m/%d/%Y %H:%M').hour
end

But that's a very procedural way of building an array, and not very Rubyish. The idiomatic way would be to use map to construct the new array all at once:

@array = time.map { |t| DateTime.strptime(t, '%m/%d/%Y %H:%M').hour }

As a side note, I'm not sure why you decided to make @array an instance variable. You could just call it array; using @ for arrays is a Perl thing, not Ruby.

Anyway, once you fix the creation of @array, your logic for building the count Hash should work as-is. You could, however, build it in a similar way by using reduce; this is just one possibility:

result = @array.reduce({}) { |h, t| h.merge({t => h[t]+1}) }

You can further simplify the logic by using a built-in method of Ruby arrays called group_by. This call:

time.group_by { |t| DateTime.strptime(t, '%m/%d/%Y %H:%M').hour }

returns this Hash:

{10=>["11/12/08 10:47"], 13=>["11/12/08 13:23", "11/12/08 13:30"], 14=>["11/12/08 14:04", "11/12/08 14:46"]}

That's close to what you want in result; all you have to do is replace those array values with their lengths. Fortunately, map works on Hashes, too, but what it returns is an array of arrays instead of another Hash, so you have to convert it back when you're done. This will do the trick:

result = Hash[time.group_by { |t| DateTime.strptime(t, '%m/%d/%Y %H:%M').hour }.map{|k,v| [k, v.length]}]

Create a hash of matching words, occurrence

Meditate on this:

'b c c d'.split # => ["b", "c", "c", "d"]
'b c c d'.split.group_by{ |w| w } # => {"b"=>["b"], "c"=>["c", "c"], "d"=>["d"]}
'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] } # => [["b", 1], ["c", 2], ["d", 1]]
'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "c"=>2, "d"=>1}

From that we can build:

dictionary = ['b', 'c']
word_count = 'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h
word_count.values_at(*dictionary) # => [1, 2]

If you only want key/value pairs that are in the dictionary, you can do that easily:

require 'active_support/core_ext/hash/slice'
word_count.slice(*dictionary) # => {"b"=>1, "c"=>2}

group_by is a very useful method that groups by whatever criteria you pass to it. values_at takes a list of "keys" and returns their corresponding values.

There are potential problems when counting "words", because not all text results in what we'd consider a word after splitting it into its component sub-strings. For instance:

'how now brown cow.'.split # => ["how", "now", "brown", "cow."]

Notice that the final word has the punctuation included in the string. Similarly, compound words and other punctation can cause problems:

'how-now brown, cow.'.split # => ["how-now", "brown,", "cow."]

The task then becomes how to remove those from being considered as parts of the words. The simple thing is to simply strip them out:

'how-now brown, cow.'.gsub(/[^a-z]+/, ' ').split # => ["how", "now", "brown", "cow"]

In today's crazy age though, we see words that contain digits too, especially things like company and program names. You can modify the pattern in gsub above to handle that, but how is left for you to figure out.

We also see mixed case, so your dictionary needs to be folded to upper-case or lower-case, and the string being considered needs to also be folded the same way, unless you want to know the different counts when honoring character case:

word_count = 'b C c d'.downcase.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "c"=>2, "d"=>1}
word_count = 'b C c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "C"=>1, "c"=>1, "d"=>1}

Analyzing the content of pages often starts with this sort of code, but many rules have to be written to specify what are useful words and what are garbage. And, the rules often change from one source to another as their use of words and numbers can break the usefulness of your code quickly:

second
2nd

for instance. It gets "interesting".

How to declare and assign a hash with count of occurrences in one statement

Can roll it in a subroutine ... or use the ones provided in libraries

For simple and fast element-frequency counter there is List::MoreUtils::frequency

use List::MoreUtils qw(frequency);

my %freq_count = frequency LIST;

The LIST is any dynamically generated list (that lemmatizer(...) here), or an array variable.

If you'd like a possibility to fine-tune what is counted there is count_by in List::UtilsBy

use List::UtilsBy qw(count_by);

my %freq_count = count_by { $_ } LIST;

Again, LIST is any dynamically produced list or an array variable, and in your case that would be lemmatizer(...) (which returns a list).

The frequency count is returned for values to which the code inside the block evaluates; each element in turn is provided in $_. So with lone $_ the count is for elements themselves.

Both List::MoreUtils and List::UtilsBy probably need be installed from CPAN.



Related Topics



Leave a reply



Submit