Is Regexp.Last_Match Thread Safe

Is Regexp.last_match thread safe?

The Ruby 1.9.2 platform docs state that calling Regexp.last_match is equivalent to reading the special $~ global variable.

From "The Ruby Programming Language", pg 318: "it is important to remember that $~ and the variables derived from it are all thread-local and method-local."

So Regexp.last_match is thread-safe. As for the other methods you are using in method_missing, I believe they are thread-safe as well. (If anybody knows differently, please edit this post.)

Bug with RegExp in JavaScript when do global search

The reason for this behavior is that RegEx isn't stateless. Your second test will continue to look for the next match in the string, and reports that it doesn't find any more. Further searches starts from the beginning, as lastIndex is reset when no match is found:

var pattern = /te/gi;

pattern.test('test');
>> true
pattern.lastIndex;
>> 2

pattern.test('test');
>> false
pattern.lastIndex;
>> 0

You'll notice how this changes when there are two matches, for instance:

var pattern = /t/gi;

pattern.test('test');
>> true
pattern.lastIndex;
>> 1

pattern.test('test');
>> true
pattern.lastIndex;
>> 4

pattern.test('test');
>> false
pattern.lastIndex;
>> 0

How to get a complete (eager) match with module Str in OCaml

The OCaml code seems to be matching exactly the same substring as the tcl code, so I'm not sure what extra eagerness you're looking for.

You're right, this interface to string matching is not functional. It depends on hidden state, and is very likely not thread-safe. OCaml Batteries Included has Str.search, which is slightly nicer, and it suggests that the non-FP parts of the interface should be considered obsolescent.

What is the difference between .? and . regular expressions?

It is the difference between greedy and non-greedy quantifiers.

Consider the input 101000000000100.

Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.

.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.

In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).

See Quantifier Cheat Sheet.

Python's re module - saving state?

Trying out some ideas...

It looks like you would ideally want an expression with side effects. If this were allowed in Python:

if m = re.match('foo (\w+) bar (\d+)', line):
  # do stuff with m.group(1) and m.group(2)
elif m = re.match('baz whoo_(\d+)', line):
  # do stuff with m.group(1)
elif ...

... then you would clearly and cleanly be expressing your intent. But it's not. If side effects were allowed in nested functions, you could:

m = None
def assign_m(x):
  m = x
  return x

if assign_m(re.match('foo (\w+) bar (\d+)', line)):
  # do stuff with m.group(1) and m.group(2)
elif assign_m(re.match('baz whoo_(\d+)', line)):
  # do stuff with m.group(1)
elif ...

Now, not only is that getting ugly, but it's still not valid Python code -- the nested function 'assign_m' isn't allowed to modify the variable m in the outer scope. The best I can come up with is really ugly, using nested class which is allowed side effects:

# per Brian's suggestion, a wrapper that is stateful
class m_(object):
  def match(self, *args):
    self.inner_ = re.match(*args)
    return self.inner_
  def group(self, *args):
    return self.inner_.group(*args)
m = m_()

# now 'm' is a stateful regex
if m.match('foo (\w+) bar (\d+)', line):
  # do stuff with m.group(1) and m.group(2)
elif m.match('baz whoo_(\d+)', line):
  # do stuff with m.group(1)
elif ...

But that is clearly overkill.

You migth consider using an inner function to allow local scope exits, which allows you to remove the else nesting:

def find_the_right_match():
  # now 'm' is a stateful regex
  m = re.match('foo (\w+) bar (\d+)', line)
  if m:
    # do stuff with m.group(1) and m.group(2)
    return # <== exit nested function only
  m = re.match('baz whoo_(\d+)', line)
  if m:
    # do stuff with m.group(1)
    return

find_the_right_match()

This lets you flatten nesting=(2*N-1) to nesting=1, but you may have just moved the side-effects problem around, and the nested functions are very likely to confuse most Python programmers.

Lastly, there are side-effect-free ways of dealing with this:

def cond_with(*phrases):
  """for each 2-tuple, invokes first item.  the first pair where
  the first item returns logical true, result is passed to second
  function in pair.  Like an if-elif-elif.. chain"""
  for (cond_lambda, then_lambda) in phrases:
    c = cond_lambda()
    if c:
      return then_lambda(c) 
  return None

cond_with( 
  ((lambda: re.match('foo (\w+) bar (\d+)', line)), 
      (lambda m: 
          ... # do stuff with m.group(1) and m.group(2)
          )),
  ((lambda: re.match('baz whoo_(\d+)', line)),
      (lambda m:
          ... # do stuff with m.group(1)
          )),
  ...)

And now the code barely even looks like Python, let alone understandable to Python programmers (is that Lisp?).

I think the moral of this story is that Python is not optimized for this sort of idiom. You really need to just be a little verbose and live with a large nesting factor of else conditions.

Regex gone wild: java.util.regex.Pattern matcher goes into high CPU loop

Note that pobrelkey's and David Wallace's answers are both correct but here's a bit more explanation...

The reason this regex is "going wild" (great title BTW) is because it is subject to catastrophic backtracking. It has the classic: /^(A*)*$/ form. Note that this runaway behavior only occurs when the pattern does NOT match the target string.

Given the runaway pattern: ^(A*|B*|C*|D*)*$ there are several options to fix it:

^(A|B|C|D)*$ - Remove the asterisk (the "zero or more" quantifier) from each of the four alternatives within the group.
^(A*+|B*+|C*+|D*+)*$ - Make each alternative asterisk quantifier possessive (i.e. Change each * to *+).
^(?>A*|B*|C*|D*)*$ - Make the group containing the alternatives atomic.

The second two should perform quite a bit faster than the first, but all three will fix the "regex gone wild" problem. (And yes, its best to NOT parse HTML with regex.)

Java: Replace RegEx with method outcome

Don't try to do it as oneliner. If you use a loop to check for all the patterns that might match

Here's some code that will do the trick for you (this should compile and run as-is)

package org.test.stackoverflow;

import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PatternReplacer {
  private final Pattern keyPattern = Pattern.compile("%([^%]*)%");
  private final Properties properties;

  public PatternReplacer(Properties propertySeed) {
    properties = propertySeed;
  }

  public String replace(String input) {
    int start = 0;

    while(true) {
      Matcher match = keyPattern.matcher(input);

      if(!match.find(start)) break;

      String group = match.group(1);
      if(properties.containsKey(group)) {
        input = input.replaceAll("%" + group + "%", properties.getProperty(group));
      } else {
        start = match.start() + group.length();
      }
    }

    return input;
  }

  public static void main(String... args) {
    Properties p = new Properties();
    p.put("animal1", "cat");
    p.put("animal2", "dog");

    PatternReplacer test = new PatternReplacer(p);
    String result = test.replace("foo %animal1% %bar% %animal2%baz %animal1% qu%ux");
    System.out.println(result);
  }
}

Output:

foo cat %bar% dogbaz cat qu%ux

Is the order of array returned from C#'s Regex.Matches guaranteed to be in the order of the text?

While the MSDN doesn't specifically state it, it's pretty clear that the matches always will be in order. The MSDN describes how the MatchCollection object is lazy-loaded. Since regex patterns are always processed in a linear fashion (either left-to-right or right-to-left), it's hard to imagine that they would be lazy-loaded in any other order.

For instance, here is an excerpt from this MSDN article:

The MatchCollection object is populated as needed on a match-by-match basis. It is equivalent to the regular expression engine calling the Regex.Match method repeatedly and adding each match to the collection. This technique is used when the collection is accessed through its GetEnumerator method, or when it is accessed using the foreach statement (in C#) or the For Each...Next statement (in Visual Basic).

If it is the same as calling match repeatedly (passing the end position of the last match as the start position for the next one), then clearly that implies that they would be in order.

When you combine that with the presence of the RegexOptions.RightToLeft option, it becomes even more clear:

By default, the regular expression engine searches from left to right. You can reverse the search direction by using the RegexOptions.RightToLeft option. The search automatically begins at the last character position of the string. For pattern-matching methods that include a starting position parameter, such as Regex.Match(String, Int32), the starting position is the index of the rightmost character position at which the search is to begin.

Even so, if you don't trust it, and you must guarantee the order, you could sort them by the Match.Index property:

var matches = Regex.Matches(input, pattern).OrderBy(x=>x.Index);

How to match line start and ending with specific text

You can use

Zahlbetrag.*?\K\d+,\d{2}(?=\s*CHF)

See the regex demo. Details:

Zahlbetrag - a literal string
.*? - zero or more chars other than line break chars, as few as possible
\K - omit the text matched so far
\d+ - one or more digits
, - a comma
\d{2} - two digits
(?=\s*CHF) - a positive lookahead that matches a location immediately followed with zero or more whitespaces and then CHF.

Is Regexp.Last_Match Thread Safe