Ruby Text Analysis

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

Check the following thread, which contains more details and links:

Building openears compatible language model

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92's post lists some more Ruby NLP resources.

You can also Google for "ARPA Language Model" for more info

Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!

How to analyze text in Ruby?

1.) For abbreviations you could steal from here: https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/abbreviation.rb. As for acronyms the list could be endless, so it really depends on what you are trying to do. You could potentially try a regular expression to extract acronyms.

2.) Not sure, you'll have to be more specific about what you are trying to accomplish.

3.) Use the lingua gem and check out this tutorial.

4.) Check out engtagger, a Ruby Part-Of-Speech Tagger Library.

5.) I am not aware of any library that can automatically detect correct grammar / punctuation errors (as there would be many cases where there is no clear cut correct answer). I did however make a gem where a human can correct a sentence and the gem will automatically show the diff between the incorrect sentence and correct sentence including the number of errors, type of errors, etc. It is called Chat Correct.

6.) Check out the gem called verbs.

Sentiment Analysis with ruby

I used lib linear a lot for other classification not for sentiment analysis
Are you interested in using lib linear or to do sentiment analysis?
For simple sentiment analysis look at
https://chrismaclellan.com/blog/sentiment-analysis-of-tweets-using-ruby

Natural Language Processing in Ruby

There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.

NLP / Rails sentiment search

You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]

1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.

Applied to your use-case, this would look like this:

# pip install transformers  # pip install in terminal
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']

classifier(sequence, candidate_labels, multi_class=True)

# output: 
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
 'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}

The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.

2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:

classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)

# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]

Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

How to use Stanford CoreNLP java library with Ruby for sentiment analysis?

As suggested in the comments by @Qualtagh, I decided to use JRuby.

I first attempted to use Java to use MongoDB as the interface (read directly from MongoDB, analyze with Java / CoreNLP and write back to MongoDB), but the MongoDB Java Driver was more complex to use than the Mongoid ORM I use with Ruby, so this is why I felt JRuby was more appropriate.

Doing a REST service for Java would have required me first to learn how to do a REST service in Java, which might have been easy, or then not. I didn't want to spend time figuring that out.

So the code I needed to do to run my code was:

  def analyze_tweet_with_corenlp_jruby
    require 'java'
    require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class

    analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc.
    result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides

    self.corenlp_sentiment = result # adds the result into this field in the MongoDB model
    self.save!
    return "#{result}: #{self.text}" # for debugging purposes
  end

Ruby Text Analysis