Split Paragraphs into Sentences

How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

Splitting paragraph into sentences

It is not regex for direct split, but kind of workaround:

(?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+[.?!]["']?)\s

DEMO

You can replace matched fragment with for example: $1# (or other char not occuring in text, instead of #), and then split it with # DEMO.
However it is not too elegant solution.

Split Paragraph Document into Sentences

For MySQL 8.0, you can use a recursive CTE, given its limitations.

with
recursive r as (
select
1 id,
cast(regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)'
) as char(256)) sentences,
id doc_id, Title, Paragraph
from master_data
union all
select id + 1,
regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)',
1, id + 1
),
doc_id, Title, Paragraph
from r
where sentences is not null
)
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

Output:

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
| 1 | I want. | 1 | asds.. |
| 2 | Some. | 1 | asds.. |
| 3 | Coconut and Banana !! | 1 | asds.. |
| 1 | Milkshake? | 2 | wad... |
| 2 | some Nice milk. | 2 | wad... |
| 1 | bar | 3 | foo |

Demo on DB Fiddle.

Split text into smaller paragraphs of a minimal length without breaking the sentences given a threshold

IIUC, you want to split the text on dot, but try to keep a minimal length of the chunks to avoid having very short sentences.

What you can do is to split on the dots and join again until you reach a threshold (here 200 characters):

out = []
threshold = 200
for chunk in text.split('. '):
if out and len(chunk)+len(out[-1]) < threshold:
out[-1] += ' '+chunk+'.'
else:
out.append(chunk+'.')

output:

['Marketing products and services is a demanding and tedious task in today’s overly saturated market. Especially if you’re in a B2B lead generation business.',
'As a business owner or part of the sales team, you really need to dive deep into understanding what strategies work best and how to appeal to your customers most efficiently.',
'Lead generation is something you need to master. Understanding different types of leads will help you sell your product or services and scale your business faster.',
'That’s why we’re explaining what warm leads are and how you can easily turn them into paying customers..']

Python - RegEx for splitting text into sentences (sentence-tokenizing)

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

Split paragraph into sentences with Python3

This is very likely better handled with nltk (having installed it correctly, that is):

from nltk.tokenize import sent_tokenize

string = "This is a sentence. This is another. And here one another, same line, starting with space. this sentence starts with lowercase letter. Here is a site you may know: google.com."

sent_tokenize_list = sent_tokenize(string)
print(sent_tokenize_list)
# ['This is a sentence.', 'This is another.', 'And here one another, same line, starting with space.', 'this sentence starts with lowercase letter.', 'Here is a site you may know: google.com.']


Related Topics



Leave a reply



Submit