What is a regular expression for parsing out individual sentences?
Try this @"(\S.+?[.!?])(?=\s+|$)"
:
string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";
Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
int i = match.Index;
Console.WriteLine(match.Value);
}
Results:
Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.
For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.
Here is the SharpNLP info, and features:
SharpNLP is a collection of natural
language processing tools written in
C#. Currently it provides the
following NLP tools:
- a sentence splitter
- a tokenizer
- a part-of-speech tagger
- a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
- a parser
- a name finder
- a coreference tool
- an interface to the WordNet lexical database
What is a regular expression for parsing out persian individual sentences?
How about this: DEMO
([^!؟.؛]+[؟.؛!])
which matches everything that doesn't include those characters, then a punctuation?
Regular Expression for parsing parts of a sentence
I used Splunk Interactive field extractor.
Use following regex in your search as
For Service type
| rex "(?i)^The\s(?P<ServiceType>[^ ]+)\sservice"
For Service Status
| rex "(?i)sent\sa\s(?P<ServiceStatus>[^ ]+)"
Use fields "ServiceType" and "ServiceStaus" for further result and charting.
\s is for space or can use actual space " ".
Regexp for parsing words from sentence
$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);
The \\s
will match all white space characters (such as space, tab, new line, etc). The .
will match, well a .
... If you wanted to add more characters, just add them after the .
(with the exceptions that a [
, a ]
and a #
must be escaped with \\
, and a -
must be the last character in the list)...
It will return for your above sentence:
array(9) {
[0]=>
string(2) "My"
[1]=>
string(4) "name"
[2]=>
string(2) "is"
[3]=>
string(3) "Bob"
[4]=>
string(3) "I'm"
[5]=>
string(3) "104"
[6]=>
string(3) "yrs"
[7]=>
string(3) "old"
}
Improve regex to Split large text into sentences
\p{Lt}
indicates a Unicode uppercase letter (including accents etc.), so
string[] sentences = Regex.Split(mytext, @"(?<=[.!?])\s+(?=\p{Lt})");
should do what you want.
(Note that I don't think .
or ?
need to be escaped in a character class so I've removed them too, but do check that this still works with those characters.)
However, note that this will still split on e.g. Mr. Jones
...
Parse out single complete sentence containing a certain string pattern with regex
The solution is that: [^。]*C[^。]*。
To answer the problem with your original regex /。.+?C.+?。/
for which you wanted explanation as to why it didn't work would be :
- it will first match
。
- Then
.+?C
will keep fetching characters until it finds C thus this
entire thing would match :xxxx。xxx。xxx。xxxxx。xxx
Once C is being found, the last part of your regex
.+?。
would be in
action. It fetches everything up to the next 。Therefore you get the result :
。xxxx。xxx。xxx。xxxxx。xxxCxxxx。
This one [^。]*C[^。]*。
works for you because:
[^。]*C
fetches anything but 。 ; and those anything must be followed
by C which makes itxxxC
[^。]*。
it again fetches anything but 。 and stopped when it finds 。
and matches it.
Find out till where a regex satisfies a sentence
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?
, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n
and a pattern of n
parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b
(which should be equivalent to a+b
) will not match ab
but only aab
as the single a
is already "consumed" by the a?
.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
Related Topics
Ternary Operator Vb VS C#: Why Resolves Nothing to Zero
Get Line Number for Xelement Here
Stop SQL Query Execution from .Net Code
Enterprise Library Unity VS Other Ioc Containers
Get User Location by Ip Address
How to Cast Object to Its Actual Type
Format Decimal for Percentage Values
With Unity How to Inject a Named Dependency into a Constructor
Using SQLdataadapter to Insert a Row
Convert Am/Pm Time to 24 Hours Format
Is This Thread.Abort() Normal and Safe
Regex Split String But Keep Separators
Difference Between System.Datetime.Now and System.Datetime.Today