How can I split a text into sentences?
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(I haven't tried it!)
How to split string to substrings with given length but not breaking sentences?
The steps I'd take:
- Initiate a list to store the lines and a current
line
variable to store the string of the current line. - Split the paragraph into sentences - this requires you to
.split
on'.'
, remove the trailing empty sentence (""
), strip leading and trailing whitespace (.strip
) and then add the fullstops back. - Loop through these sentences and:
- if the sentence can be added onto the current line, add it
- otherwise add the current working line string to the list of lines and set the current line string to be the current sentence
So, in Python, something like:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
giving lines
as:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]
Splitting text into sentences using regex in Python
Any ideas on how to remove the extra '' at the end of my current
output?
You could remove it by doing this:
sentences[:-1]
Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)
del result[-1]
Output:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
Split text into smaller paragraphs of a minimal length without breaking the sentences given a threshold
IIUC, you want to split the text on dot, but try to keep a minimal length of the chunks to avoid having very short sentences.
What you can do is to split on the dots and join again until you reach a threshold (here 200 characters):
out = []
threshold = 200
for chunk in text.split('. '):
if out and len(chunk)+len(out[-1]) < threshold:
out[-1] += ' '+chunk+'.'
else:
out.append(chunk+'.')
output:
['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers..']
Split string into sentences
Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.
A better approach is to use a BreakIterator configured with the right Locale.
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}
Yields the following result:
- This is a test.
- This is a T.L.A. test.
- Now with a Dr. in it.
JS split text into sentences
Use $
to match the end of the string:
/[^\.!\?]+[\.!\?]+["']?|.+$/g
Or maybe you want to allow whitespace characters at the end:
/[^\.!\?]+[\.!\?]+["']?|\s*$/g
Related Topics
Interview Question: How to Have an Echo Before Header
What Is Laravel Render() Method For
MySQL Password Hashing Method Old VS New
"Premature End of Data" Error with PHP
PHP Foreach() with Arrays Within Arrays
Including PHP Variables in an External Js File
Call to Undefined Function Oci_Connect, PHP_Oci8_12C.Dll, Windows 8.1, PHP5.6.6
Creating a Secure File Hosting Server for PDFs
Laravel 5 How to Get Route Action Name
Alternative for $_Server['Http_Referer'] PHP Variable in Msie
Codeigniter Activerecord: Join Backticking
How Does Codeigniter Receive the Ajax Post Data in Controller
How to Dynamically Set Table Name in Eloquent Model
Url Encoded Forward Slashes Breaking My Codeigniter App
PHP - Hide Url (Get) Parameters