Creating a new corpus with NLTK
I think the PlaintextCorpusReader
already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
.
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())
How to create a corpus for sentiment analysis in NLTK?
The answer you refer to contains some very poor (or rather, inapplicable) advice. There is no reason to place your own corpus in nltk_data
, or to hack nltk.corpus.__init__.py
to load it like a native corpus. In fact, do not do these things.
You should use PlaintextCorpusReader
. I don't understand your reluctance to do so, but if your files are plain text, it's the right tool to use. Supposing you have a folder NLP/bettertrainingdata
, you can build a reader that will load all .txt
files in this folder like this:
myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")
If you add new files to the folder, the reader will find and use them. If what you want is to be able to use your script with other folders, then just do so-- you don't need a different reader, you need to learn about sys.argv
. If you are after a categorized corpus with pos.txt
and neg.txt
, then you need a CategorizedPlaintextCorpusReader
(which see). If it's something else yet that you want, then please edit your question to explain what you are trying to do.
Create Corpus using PlainTextCorpusReader and Analyzing It
It looks like what you want to do is tokenize the plain text documents in the folder. If this is what you want, you do this by asking the PlainTextCorpusReader for the tokens, rather than trying to pass the sentence tokenizer the PlainTextCorpusReader. So instead of
DNCtokens = sent_tokenize(DNClist)
please consider
DNCtokens = DNClist.sents()
to get the sentences or DNCtokens = DNClist.paras()
to get the paragraphs.
The source code for the reader shows that it holds a word tokenizer and a sentence tokenizer, and will call them to do the tokenization that it looks like you want.
build custom corpus with labels from text documents using nltk
I figure out the answer. Basically, I created a list of tuples and then categorize each line.
transactions = []
with open('block_1.txt', 'r') as block1, open('block_2.txt', 'r') as block2, open('block_3.txt', 'r') as block3, open('t_block_4.txt', 'r') as block4:
transactions = ([(transaction, 'block_1') for transaction in block1.readlines()] + [(transaction, 'block_2') for transaction in block2.readlines()] + [(transaction, 'block_3') for transaction in block3.readlines()] + [(transaction, 'block_4') for transaction in block4.readlines()])
#transactions
corpus = np.array(transactions)
corpus_df = pd.DataFrame({'Document': [i[0] for i in transactions],
'Category': [i[1] for i in transactions]})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
Creating custom corpus in NLTK using CSV file
If you are unpacking or reading data from a CSV file you can use Pythons' CSV module. The following code opens the file and appends everything to a list which you can then feed into the classifier.
import csv
training_set = []
with open('path/to/text.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
training_set.append((row['Text'], row['Classification']))
print training_set
If your classifier has the ability to be updated then you can skip creating the list training_set and just do .update(row['Text'], row['Classification'])
How to add a custom corpora to local machine in nltk
While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk
provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data
directory or hacking the nltk
source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.
Let's see how the movie_reviews
corpus is defined in nltk/corpora/__init__.py
:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
encoding='ascii')
You can ignore the LazyCorpusLoader
part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader
, that its files all end in .txt
, and that the reviews are sorted into categories through being in the subdirectories pos
and neg
. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):
mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
r"/home/user/path/to/my_corpus",
r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*',
encoding="ascii")
That's it; you can now call mycorpus.words()
, mycorpus.sents(categories="neg")
, etc., just as if this was a corpus provided by the nltk.
Related Topics
How to Append a New Row to an Old CSV File in Python
Sftp in Python? (Platform Independent)
Running Selenium Webdriver with a Proxy in Python
Display Image as Grayscale Using Matplotlib
How to Remove Stop Words Using Nltk or Python
Is There a Decorator to Simply Cache Function Return Values
What's the How to Install Pip, Virtualenv, and Distribute for Python
What's the Difference Between Select_Related and Prefetch_Related in Django Orm
Python Selenium Click on Button
Googletrans Stopped Working with Error 'Nonetype' Object Has No Attribute 'Group'
Get Md5 Hash of Big Files in Python
Matplotlib Plots: Removing Axis, Legends and White Spaces
Error: (-215) !Empty() in Function Detectmultiscale
Can "List_Display" in a Django Modeladmin Display Attributes of Foreignkey Fields