Creating a New Corpus with Nltk

Creating a new corpus with NLTK

I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.

PlainTextCorpusReader's constructor

def __init__(self, root, fileids,
             word_tokenizer=WordPunctTokenizer(),
             sent_tokenizer=nltk.data.LazyLoader(
                 'tokenizers/punkt/english.pickle'),
             para_block_reader=read_blankline_block,
             encoding='utf8'):

You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').

For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())

How to create a corpus for sentiment analysis in NLTK?

The answer you refer to contains some very poor (or rather, inapplicable) advice. There is no reason to place your own corpus in nltk_data, or to hack nltk.corpus.__init__.py to load it like a native corpus. In fact, do not do these things.

You should use PlaintextCorpusReader. I don't understand your reluctance to do so, but if your files are plain text, it's the right tool to use. Supposing you have a folder NLP/bettertrainingdata, you can build a reader that will load all .txt files in this folder like this:

myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")

If you add new files to the folder, the reader will find and use them. If what you want is to be able to use your script with other folders, then just do so-- you don't need a different reader, you need to learn about sys.argv. If you are after a categorized corpus with pos.txt and neg.txt, then you need a CategorizedPlaintextCorpusReader (which see). If it's something else yet that you want, then please edit your question to explain what you are trying to do.

Create Corpus using PlainTextCorpusReader and Analyzing It

It looks like what you want to do is tokenize the plain text documents in the folder. If this is what you want, you do this by asking the PlainTextCorpusReader for the tokens, rather than trying to pass the sentence tokenizer the PlainTextCorpusReader. So instead of

DNCtokens = sent_tokenize(DNClist)

please consider

DNCtokens = DNClist.sents() to get the sentences or DNCtokens = DNClist.paras() to get the paragraphs.

The source code for the reader shows that it holds a word tokenizer and a sentence tokenizer, and will call them to do the tokenization that it looks like you want.

build custom corpus with labels from text documents using nltk

I figure out the answer. Basically, I created a list of tuples and then categorize each line.

transactions = []
with open('block_1.txt', 'r') as block1, open('block_2.txt', 'r') as block2, open('block_3.txt', 'r') as block3, open('t_block_4.txt', 'r') as block4:
    transactions = ([(transaction, 'block_1') for transaction in block1.readlines()] + [(transaction, 'block_2') for transaction in block2.readlines()] + [(transaction, 'block_3') for transaction in block3.readlines()] + [(transaction, 'block_4') for transaction in block4.readlines()])

#transactions

corpus = np.array(transactions)
corpus_df = pd.DataFrame({'Document': [i[0] for i in transactions], 
                          'Category': [i[1] for i in transactions]})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Creating custom corpus in NLTK using CSV file

If you are unpacking or reading data from a CSV file you can use Pythons' CSV module. The following code opens the file and appends everything to a list which you can then feed into the classifier.

import csv

training_set = []

with open('path/to/text.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        training_set.append((row['Text'], row['Classification']))

print training_set

If your classifier has the ability to be updated then you can skip creating the list training_set and just do .update(row['Text'], row['Classification'])

How to add a custom corpora to local machine in nltk

While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data directory or hacking the nltk source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.

Let's see how the movie_reviews corpus is defined in nltk/corpora/__init__.py:

movie_reviews = LazyCorpusLoader(
    'movie_reviews', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
    encoding='ascii')

You can ignore the LazyCorpusLoader part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader, that its files all end in .txt, and that the reviews are sorted into categories through being in the subdirectories pos and neg. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):

mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"/home/user/path/to/my_corpus",
    r'(?!\.).*\.txt', 
    cat_pattern=r'(neg|pos)/.*',
    encoding="ascii")

That's it; you can now call mycorpus.words(), mycorpus.sents(categories="neg"), etc., just as if this was a corpus provided by the nltk.

Creating a New Corpus with Nltk