How to Split a Text into Sentences

How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

Splitting text into sentences using regex in Python

Any ideas on how to remove the extra '' at the end of my current
output?

You could remove it by doing this:

sentences[:-1]

Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)

del result[-1]

Output:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

Split string into sentences

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.

How do I split a string into sentences whilst including the punctuation marks?

Try splitting on a lookbehind:

sentences = re.split('(?<=[\.\?\!])\s*', x)
print(sentences)

['This is an example sentence.', 'I want to include punctuation!',
'What is wrong with my code?', 'It makes me want to yell, "PLEASE HELP ME!"']

This regex trick works by splitting when we see a punctuation symbol immediately behind us. In this case, we also match and consume any whitespace in front of us, before we continue down the input string.

Here is my mediocre attempt to deal with the double quote problem:

x = 'This is an example sentence. I want to include punctuation! "What is wrong with my code?"  It makes me want to yell, "PLEASE HELP ME!"'
sentences = re.split('((?<=[.?!]")|((?<=[.?!])(?!")))\s*', x)
print filter(None, sentences)

['This is an example sentence.', 'I want to include punctuation!',
'"What is wrong with my code?"', 'It makes me want to yell, "PLEASE HELP ME!"']

Note that it correctly splits even sentences which end in double quotes.

Python - RegEx for splitting text into sentences (sentence-tokenizing)

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

Split text into smaller paragraphs of a minimal length without breaking the sentences given a threshold

IIUC, you want to split the text on dot, but try to keep a minimal length of the chunks to avoid having very short sentences.

What you can do is to split on the dots and join again until you reach a threshold (here 200 characters):

out = []
threshold = 200
for chunk in text.split('. '):
if out and len(chunk)+len(out[-1]) < threshold:
out[-1] += ' '+chunk+'.'
else:
out.append(chunk+'.')

output:

['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers..']

Splitting text file into sentences

It's actually were simple, you adding \n(newline character) on every iteration, so, for example you splitting Kek. it will add to string variable Kek\n and then .\n.
You need to do something like this:

with open("text.txt") as file:
for line in file:
for l in re.split(r"(\. |\? |\! )",line):
string += l
string += '\n'


Related Topics



Leave a reply



Submit