Split a Text into Sentences

How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

How to split string to substrings with given length but not breaking sentences?

The steps I'd take:

  • Initiate a list to store the lines and a current line variable to store the string of the current line.
  • Split the paragraph into sentences - this requires you to .split on '.', remove the trailing empty sentence (""), strip leading and trailing whitespace (.strip) and then add the fullstops back.
  • Loop through these sentences and:

    • if the sentence can be added onto the current line, add it
    • otherwise add the current working line string to the list of lines and set the current line string to be the current sentence

So, in Python, something like:

para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence

giving lines as:

[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]

Splitting text into sentences using regex in Python

Any ideas on how to remove the extra '' at the end of my current
output?

You could remove it by doing this:

sentences[:-1]

Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)

del result[-1]

Output:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

Split text into smaller paragraphs of a minimal length without breaking the sentences given a threshold

IIUC, you want to split the text on dot, but try to keep a minimal length of the chunks to avoid having very short sentences.

What you can do is to split on the dots and join again until you reach a threshold (here 200 characters):

out = []
threshold = 200
for chunk in text.split('. '):
if out and len(chunk)+len(out[-1]) < threshold:
out[-1] += ' '+chunk+'.'
else:
out.append(chunk+'.')

output:

['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers..']

Split string into sentences

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.

JS split text into sentences

Use $ to match the end of the string:

/[^\.!\?]+[\.!\?]+["']?|.+$/g

Or maybe you want to allow whitespace characters at the end:

/[^\.!\?]+[\.!\?]+["']?|\s*$/g


Related Topics



Leave a reply



Submit