N-Gram Generation from a Sentence

N-gram generation from a sentence

You are looking for ShingleFilter.

Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.

Word n-gram list of sentences in python

(Assuming you meant n-gram words instead of char), not sure if there is chances of duplicate sentences but you can try set of input sentences and may be list comprehension:

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = []
for i in range(len(sentence)):
for n in range(2, 4):
for item in ngrams(sentence[i].split(), n):
n_grams.append(item)

Result:

1000 loops, best of 3: 228 µs per loop

Just using list comprehension, it had some improvement:

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

Result:

1000 loops, best of 3: 214 µs per loop

Other way is to use set and list comprehension:

%%timeit
from nltk import ngrams
sentences = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
# use of set
sentence = set(sentences)
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

Result:

10000 loops, best of 3: 23.5 µs per loop

So, if there are lot of duplicate sentences then, it might help.

n-grams in python, four, five, six grams?

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
print grams

quicker way to detect n-grams in a string?

You could try something like this:

public class NGram {

private final int n;
private final String text;

private final int[] indexes;
private int index = -1;
private int found = 0;

public NGram(String text, int n) {
this.text = text;
this.n = n;
indexes = new int[n];
}

private boolean seek() {
if (index >= text.length()) {
return false;
}
push();
while(++index < text.length()) {
if (text.charAt(index) == ' ') {
found++;
if (found<n) {
push();
} else {
return true;
}
}
}
return true;
}

private void push() {
for (int i = 0; i < n-1; i++) {
indexes[i] = indexes[i+1];
}
indexes[n-1] = index+1;
}

private List<String> list() {
List<String> ngrams = new ArrayList<String>();
while (seek()) {
ngrams.add(get());
}
return ngrams;
}

private String get() {
return text.substring(indexes[0], index);
}
}

Testing on about 5mb of text it seems to perform about 10 times faster than the original code. The main difference here is that regex is not used to split and the ngram strings are not created by concatenation.

Update:
This is the output I get when running on the text mentioned above, ngram 1-4. I run with 2GB of memory to decide the impact on GC during the runs. I ran multiple times to see the impact of the hotspot compiler.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

N-grams: Explanation + 2 applications

Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram the probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..

Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.

One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be the in one document and is in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.

Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include the, a, is, and of would be considered a better match than two documents that both included Baysian. Removing stop-words from your n-grams of interest would improve this somewhat.

Fast n-gram calculation

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens

count = defaultdict(int)

join_spaces = " ".join

n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1

return count


Related Topics



Leave a reply



Submit