Is There an Equivalent of Java.Util.Regex for "Glob" Type Patterns

Is there an equivalent of java.util.regex for glob type patterns?

There's nothing built-in, but it's pretty simple to convert something glob-like to a regex:

public static String createRegexFromGlob(String glob)
{
String out = "^";
for(int i = 0; i < glob.length(); ++i)
{
final char c = glob.charAt(i);
switch(c)
{
case '*': out += ".*"; break;
case '?': out += '.'; break;
case '.': out += "\\."; break;
case '\\': out += "\\\\"; break;
default: out += c;
}
}
out += '$';
return out;
}

this works for me, but I'm not sure if it covers the glob "standard", if there is one :)

Update by Paul Tomblin: I found a perl program that does glob conversion, and adapting it to Java I end up with:

    private String convertGlobToRegEx(String line)
{
LOG.info("got line [" + line + "]");
line = line.trim();
int strLen = line.length();
StringBuilder sb = new StringBuilder(strLen);
// Remove beginning and ending * globs because they're useless
if (line.startsWith("*"))
{
line = line.substring(1);
strLen--;
}
if (line.endsWith("*"))
{
line = line.substring(0, strLen-1);
strLen--;
}
boolean escaping = false;
int inCurlies = 0;
for (char currentChar : line.toCharArray())
{
switch (currentChar)
{
case '*':
if (escaping)
sb.append("\\*");
else
sb.append(".*");
escaping = false;
break;
case '?':
if (escaping)
sb.append("\\?");
else
sb.append('.');
escaping = false;
break;
case '.':
case '(':
case ')':
case '+':
case '|':
case '^':
case '$':
case '@':
case '%':
sb.append('\\');
sb.append(currentChar);
escaping = false;
break;
case '\\':
if (escaping)
{
sb.append("\\\\");
escaping = false;
}
else
escaping = true;
break;
case '{':
if (escaping)
{
sb.append("\\{");
}
else
{
sb.append('(');
inCurlies++;
}
escaping = false;
break;
case '}':
if (inCurlies > 0 && !escaping)
{
sb.append(')');
inCurlies--;
}
else if (escaping)
sb.append("\\}");
else
sb.append("}");
escaping = false;
break;
case ',':
if (inCurlies > 0 && !escaping)
{
sb.append('|');
}
else if (escaping)
sb.append("\\,");
else
sb.append(",");
break;
default:
escaping = false;
sb.append(currentChar);
}
}
return sb.toString();
}

I'm editing into this answer rather than making my own because this answer put me on the right track.

Match path string using glob in Java

If you have Java 7 can use FileSystem.getPathMatcher:

final PathMatcher matcher = FileSystem.getPathMatcher("glob:**/*.txt");

This will require converting your strings into instances of Path:

final Path myPath = Paths.get("/foo/bar.txt");

For earlier versions of Java you might get some mileage out of Apache Commons' WildcardFileFilter. You could also try and steal some code from Spring's AntPathMatcher - that's very close to the glob-to-regex approach though.

Find file with a pattern

String.matches() takes a regular expression, and not a glob pattern.

It so happens that ENV20120517*.*DAT is a valid regex. It does, however, have a different meaning to what you're expecting: it matches any string that starts with ENV2012051 and ends in DAT (the .* matches anything, and the 7* is effectively a no-op).

The following regex is equivalent to the pattern in your question ENV20120517.*[.].*DAT

For some ideas on how to do glob matching in Java, see Is there an equivalent of java.util.regex for "glob" type patterns?

C equivalent to java.util.regex

I've had some luck using PCRE for complicated regexes from C or C++.
It's pretty widely used and compliant. It used to have some issues with unicode data, but it looks like some of those have been resolved now.

PCRE supports named captures as used in your example using the pcre_copy_named_substring function.

glob patterns difference between {} and +()

{} implements something similar to Bash's brace expansion. Essentially src/**/*.{js,jsx,ts,tsx,json,css} will become:

[
'src/**/*.js',
'src/**/*.jsx',
'src/**/*.ts',
'src/**/*.tsx',
'src/**/*.json',
'src/**/*.css'
]

There is a time and place for this, but you can see this might be less efficient as now you are processing multiple patterns.

You can think of +() more like in regular expression where +(js|jsx|ts|tsx|json|css) would be more equivalent to (js|jsx|ts|tsx|json|css)+.
So it would match things like js or jsjsxtxjson Which is not really equivalent to {}.

What you are probably interested in, if looking for a more efficient comparison to {}, is probably @(js|jsx|ts|tsx|json|css) which is equivalent to regular expression patterns like this (js|jsx|ts|tsx|json|css) which would match just one occurrence and would match js but not jsjsxtxjson. The reason why this may be more efficient is simply that you get a single pattern as opposed to multiple patterns.

Pattern searching

Have a look into the regular expression package java.util.regex. You find a good starting point here.



Related Topics



Leave a reply



Submit