Programmatically Derive a Regular Expression from a String

Find string to regular expression programmatically?

Assume you define regular expressions like this:

R :=
<literal string>
(RR) -- concatenation
(R*) -- kleene star
(R|R) -- choice

Then you can define a recursive function S(r) which finds a matching string:

S(<literal string>) = <literal string>
S(rs) = S(r) + S(s)
S(r*) = ""
S(r|s) = S(r)

For example: S(a*(b|c)) = S(a*) + S(b|c) = "" + S(b) = "" + "b" = "b".

If you have a more complex notion of regular expression, you can rewrite it in terms of the basic primitives and then apply the above. For example, R+ = RR* and [abc] = (a|b|c).

Note that if you've got a parsed regular expression (so you know its syntax tree), then the above algorithm takes at most time linear in the size of the regular expression (assuming you're careful to perform the string concatenations efficiently).

finding regular exprssions from a list of string examples

I implemented a solution myself. I made a small python package out of it and put it in my Github repo @ http://github.com/shivylp/RegexUtils

If anyone is looking for something similar, can use it.

finding regular exprssions from a list of string examples

I implemented a solution myself. I made a small python package out of it and put it in my Github repo @ http://github.com/shivylp/RegexUtils

If anyone is looking for something similar, can use it.

How to know if a string could match a regular expression by adding more characters

You can do it as easy as

boolean couldMatch(CharSequence charsSoFar, Pattern pattern) {
Matcher m = pattern.matcher(charsSoFar);
return m.matches() || m.hitEnd();
}

If the sequence does not match and the engine did not reach the end of the input, it implies that there is a contradicting character before the end, which won’t go away when adding more characters at the end.

Or, as the documentation says:

Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher.

When this method returns true, then it is possible that more input would have changed the result of the last search.

This is also used by the Scanner class internally, to determine whether it should load more data from the source stream for a matching operation.

Using the method above with your sample data yields

Pattern fpNumber = Pattern.compile("[+-]?\\d*\\.?\\d*");
String[] positive = {"+", "-", "123", ".24", "-1.04" };
String[] negative = { "+A", "-B", "123z", ".24.", "-1.04+" };
for(String p: positive) {
System.out.println("should accept more input: "+p
+", couldMatch: "+couldMatch(p, fpNumber));
}
for(String n: negative) {
System.out.println("can never match at all: "+n
+", couldMatch: "+couldMatch(n, fpNumber));
}
should accept more input: +, couldMatch: true
should accept more input: -, couldMatch: true
should accept more input: 123, couldMatch: true
should accept more input: .24, couldMatch: true
should accept more input: -1.04, couldMatch: true
can never match at all: +A, couldMatch: false
can never match at all: -B, couldMatch: false
can never match at all: 123z, couldMatch: false
can never match at all: .24., couldMatch: false
can never match at all: -1.04+, couldMatch: false

Of course, this doesn’t say anything about the chances of turning a nonmatching content into a match. You could still construct patterns for which no additional character could ever match. However, for ordinary use cases like the floating point number format, it’s reasonable.

Regular Expression to get a string between backtick `` in Console application (C#)

Something like this (Regular expression and Linq):

 String test = "select t.`ProductID` AS `ProductID`, t.`AttributeID` ...";

// If you want to preserve `` the pattern is @"\bAS\s*(`[^`]*?`)"
String pattern = @"\bAS\s*`([^`]*?)`";

var result = Regex
.Matches(test, pattern, RegexOptions.IgnoreCase)
.OfType<Match>()
.Select(match => match.Groups[1].Value)
.ToArray(); // if you want, say, an array representation

Console.Write(String.Join(", ", result));

And you'll get

  ProductID, AttributeID, ... , ModifiedBy

However, be careful: in general case regular expressions are not a good choice for parsing SQL; let me provide some examples to show the problems emerging:

  -- commented AS ("abc" should not be returned)
select a /* AS `abc`*/
from tbl

-- commented value ("abc" should be returned, not "obsolete" or "proposed")
select a AS /*`obsolete`*/ `abc` /*`proposed`*/
from tbl

-- String ("abc" should not be returned)
select 'a AS `abc`'
from tbl

-- honest AS ("abc" should be returned)
select a /*'*/AS `abc`--'
from tbl

-- commented comment ("abc" should be returned)
select -- /*
a AS `abc`
--*/
from tbl


Related Topics



Leave a reply



Submit