N-Grams in Python, Four, Five, Six Grams

n-grams in python, four, five, six grams?

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

Computing N Grams using Python

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
   return output

ngrams('a a a a', 2) # {'a a': 3}

Do we include all the combinations of n-grams in the actual anlaysis?

We may consider each bigram as a feature of different importance. Then the question can be reformulated as "How to choose the most important features?". As you have already mentioned, one way is to consider the top max features ordered by term frequency across the corpus. Other possible ways to choose the most important features are:

Apply the TF-IDF weighting scheme. You will also be able to control two additional hyperparameters: max document frequency and min document frequency;
Use Principle Component Analysis to select the most informative features from a big feature set.
Train any estimator in scikit-learn and then select the features from the trained model.

These are the most widespread methods of feature selection in the NLP field. It is still possible use other methods like recursive feature elimination or sequential feature selection, but these methods are not feasible if the number of informative features is low (like 1000) and the total number of features is high (like 10000).

How to split a text into N-grams and get their offset

I did not find any native way to do this, so I implemented my own to fit my use case, using the align_tokens() function in NLTK.

It resembles something like this:

tokenized_text = [word for word in word_tokenize(text) if word.lower() not in stopwords]
alignment = align_tokens(tokenized_text, text)
tokenized_with_offset = [(tokenized_text[i], alignment[i]) for i in range(len(alignment))]
ngrams(tokenized_with_offset, n)

Analysis of most common n-grams

bag_of_words is the usual 2-dimension document-ngram frequency matrix, i.e. it contains the frequency of any ngram for any document (corpus might contain any number of documents).
sum_words obtains the sum of the frequency across documents for every ngram. It's a single dimension array. Length is the vocabulary size, and the indexes are in the same order as in bag_of_words. It doesn't contain the ngrams themselves, of course.

Since the goal is to obtain the top frequent ngrams, we need to match every ngram to its frequency in sum_words. This is why the vocabulary (which contains the ngrams) is iterated with the ngram and the index: if only the index idx was obtained, there would be no way to know which actual ngram it represents. And of course the index is used to obtain the total frequency in sum_words. Thus words_freq is an array containing pairs (ngram, frequency) for every ngram.

The last 2 lines sort this array by decreasing frequency and extract the top n elements.

Quick implementation of character n-grams for word

To generate bigrams:

In [8]: b='student'

In [9]: [b[i:i+2] for i in range(len(b)-1)]
Out[9]: ['st', 'tu', 'ud', 'de', 'en', 'nt']

To generalize to a different n:

In [10]: n=4

In [11]: [b[i:i+n] for i in range(len(b)-n+1)]
Out[11]: ['stud', 'tude', 'uden', 'dent']

What's the correct implementation of bag of n-grams?

I want to slightly expand on the other answer given, specifically to the "clearly wrong". While I agree that it is not the standard approach (to my knowledge!), there is an important definition in the mentioned book, just before the shown excerpt, which states:

Word n-grams are groups of N (or fewer) consecutive words that you can extract froma sentence. The same concept may also be applied to characters instead of words

(bold highlight by me). It seems that Chollet defines n-grams slightly different from the common interpretation (namely, that a n-gram has to consist of exactly n words/chars etc.). With that, the subsequent example is entirely within the
defined circumstances, although you likely will find varying implementations of this in the real world.

One example aside from the mentioned Tensorflow/NLTK implementation would be scikit-learn's TfidfVectorizer, which has the parameter ngram_range. This is basically something in between Chollet's definition and a strict interpretation, where you can select an arbitrary minimum/maximum amount of "grams" for a single unit, which are then built similar to the above example where a single bag can have both unigrams and bigrams, for example.

Finding Four Grams in an NLTK Corpus

To get n number of items you can use nltk.ngrams() with the number to get as the second argument.

In your example, to get four-grams, you can use nltk.ngrams(nltk.corpus.brown.words(), 4)

Retrieve n-grams with word2vec

Generally, the very best way to determine "what kinds of results" you will get if you were to try certain things is to try those things, and observe the results you actually get.

In preparing text for word2vec training, it is not typical to convert an input text to the form you've shown, with a bunch of space-delimited word n-grams added. Rather, the string 'I am studying word2vec' would typically just be preprocessed/tokenized to a list of (unigram) tokens like ['I', 'am', 'studying', 'word2vec'].

The model will then learn one vector per single word – with no vectors for multigrams. And since it only knows such 1-word vectors, all the results its reports from .most_similar() will also be single words.

You can preprocess your text to combine some words into multiword entities, based on some sort of statistical or semantic understanding of the text. Very often, this process converts the runs-of-related-words to underscore-connected single tokens. For example, 'I visited New York City' might become ['I', 'visited', 'New_York_City'].

But any such preprocessing decisions are separate from the word2vec algorithm itself, which just considers whatever 'words' you feed it as 1:1 keys for looking-up vectors-in-training. It only knows tokens, not n-grams.

Mixing non-overlapping everygrams in order

Maybe it would be helpful to space your example out a bit:

(a , b , c , d)
(a , b , c   d)
(a , b   c , d)
(a   b , c , d)
(a   b , c   d)
(a , b   c   d)
(a   b   c , d)
(a   b   c   d)  # added for completeness

Looking at that, it's evident that what differentiates the rows is the presence or absence of commas, a typical binary choice. There are three places a comma could go, so there are eight possibilities, corresponding to the eight binary numbers of three digits.

The easiest way to list these possibilities is to count from 0 0 0 to 1 1 1.

For your modified question, in which there is a maximum length of a part, one simple recursive solution in Python is:

def kgram(k, v):
    'Generate all partitions of v with parts no larger than k'
    def helper(sfx, m):
        if m == 0: yield sfx
        else:
            for i in range(1, min(k, m)+1):
                yield from helper([v[m-i:m]]+sfx, m-i)

    yield from helper([], len(v))

Here's a quick test:

>>> for p in gram(3, 'one two three four five'.split()): print(p)
... 
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]

N-Grams in Python, Four, Five, Six Grams