What Is a Regular Expression for Parsing Out Individual Sentences

What is a regular expression for parsing out individual sentences?

Try this @"(\S.+?[.!?])(?=\s+|$)":

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
int i = match.Index;
Console.WriteLine(match.Value);
}

Results:

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.

Here is the SharpNLP info, and features:

SharpNLP is a collection of natural
language processing tools written in
C#. Currently it provides the
following NLP tools:

  • a sentence splitter
  • a tokenizer
  • a part-of-speech tagger
  • a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
  • a parser
  • a name finder
  • a coreference tool
  • an interface to the WordNet lexical database

What is a regular expression for parsing out persian individual sentences?

How about this: DEMO

([^!؟.؛]+[؟.؛!])

which matches everything that doesn't include those characters, then a punctuation?

Regular Expression for parsing parts of a sentence

I used Splunk Interactive field extractor.

Use following regex in your search as

For Service type

| rex "(?i)^The\s(?P<ServiceType>[^ ]+)\sservice" 

For Service Status

| rex "(?i)sent\sa\s(?P<ServiceStatus>[^ ]+)"

Use fields "ServiceType" and "ServiceStaus" for further result and charting.

\s is for space or can use actual space " ".

Regexp for parsing words from sentence

$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);

The \\s will match all white space characters (such as space, tab, new line, etc). The . will match, well a .... If you wanted to add more characters, just add them after the . (with the exceptions that a [, a ] and a # must be escaped with \\, and a - must be the last character in the list)...

It will return for your above sentence:

array(9) {
[0]=>
string(2) "My"
[1]=>
string(4) "name"
[2]=>
string(2) "is"
[3]=>
string(3) "Bob"
[4]=>
string(3) "I'm"
[5]=>
string(3) "104"
[6]=>
string(3) "yrs"
[7]=>
string(3) "old"
}

Improve regex to Split large text into sentences

\p{Lt} indicates a Unicode uppercase letter (including accents etc.), so

string[] sentences = Regex.Split(mytext, @"(?<=[.!?])\s+(?=\p{Lt})");

should do what you want.

(Note that I don't think . or ? need to be escaped in a character class so I've removed them too, but do check that this still works with those characters.)

However, note that this will still split on e.g. Mr. Jones...

Parse out single complete sentence containing a certain string pattern with regex

The solution is that: [^。]*C[^。]*。

To answer the problem with your original regex /。.+?C.+?。/ for which you wanted explanation as to why it didn't work would be :

  1. it will first match
  2. Then .+?C will keep fetching characters until it finds C thus this
    entire thing would match : xxxx。xxx。xxx。xxxxx。xxx
  3. Once C is being found, the last part of your regex .+?。 would be in
    action. It fetches everything up to the next 。

    Therefore you get the result : 。xxxx。xxx。xxx。xxxxx。xxxCxxxx。

This one [^。]*C[^。]*。 works for you because:

  1. [^。]*C fetches anything but 。 ; and those anything must be followed
    by C which makes it xxxC
  2. [^。]*。 it again fetches anything but 。 and stopped when it finds 。
    and matches it.

Find out till where a regex satisfies a sentence

Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.

def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res

Example:

>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ

Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.

Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.

You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.



Related Topics



Leave a reply



Submit