How to Create a Stream of Regex Matches

How do I create a Stream of regex matches?

Well, in Java 8, there is Pattern.splitAsStream which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.

If you are going to implement such a Stream, I recommend implementing Spliterator directly rather than implementing and wrapping an Iterator. You may be more familiar with Iterator but implementing a simple Spliterator is straight-forward:

final class MatchItr extends Spliterators.AbstractSpliterator<String> {
private final Matcher matcher;
MatchItr(Matcher m) {
super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
matcher=m;
}
public boolean tryAdvance(Consumer<? super String> action) {
if(!matcher.find()) return false;
action.accept(matcher.group());
return true;
}
}

You may consider overriding forEachRemaining with a straight-forward loop, though.


If I understand your attempt correctly, the solution should look more like:

Pattern pattern = Pattern.compile(
"[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

try(BufferedReader br=new BufferedReader(System.console().reader())) {

br.lines()
.flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

Java 9 provides a method Stream<MatchResult> results() directly on the Matcher. But for finding matches within a stream, there’s an even more convenient method on Scanner. With that, the implementation simplifies to

try(Scanner s = new Scanner(System.console().reader())) {
s.findAll(pattern)
.collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

This answer contains a back-port of Scanner.findAll that can be used with Java 8.

Pattern matching using java streams

You can use the method Matcher.results() to get a stream of match results for each subsequence of the input sequence that matches the pattern. That will also enables you to do the task in one go instead of the intermadiate steps you are doing right now like storing sub-results in a list.

List<String> catalogIds =
IntStream.range(0, fragmentedHeader.length - 1)
.filter(index -> fragmentedHeader[index+1].contains("ads_management"))
.mapToObj(index -> fragmentedHeader[index])
.flatMap(str -> Pattern.compile("[0-9]{15}").matcher(str).results())
.map(MatchResult::group)
.collect(Collectors.toList());

Match a pattern and write the stream to a file using Java 8 Stream

Unfortunately, the Java regular expression classes don't provide a stream for matched results, only a splitAsStream() method, but you don't want split.

Note: It has been added in Java 9 as Matcher.results().

You can however create a generic helper class for it yourself:

public final class PatternStreamer {
private final Pattern pattern;
public PatternStreamer(String regex) {
this.pattern = Pattern.compile(regex);
}
public Stream<MatchResult> results(CharSequence input) {
List<MatchResult> list = new ArrayList<>();
for (Matcher m = this.pattern.matcher(input); m.find(); )
list.add(m.toMatchResult());
return list.stream();
}
}

Then your code becomes easy by using flatMap():

private static final PatternStreamer quoteRegex = new PatternStreamer("\"([^\"]*)\"");
public static void main(String[] args) throws Exception {
String inFileName = "c:\\exec.log";
String outFileName = "c:\\exec_quoted.txt";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
Set<String> dataSet = stream.flatMap(quoteRegex::results)
.map(r -> r.group(1))
.collect(Collectors.toSet());
Files.write(Paths.get(outFileName), dataSet);
}
}

Since you only process a line at a time, the temporary List is fine. If the input string is very long and will have a lot of matches, then a Spliterator would be a better choice. See How do I create a Stream of regex matches?

Java Stream filter with regex not working

You're almost there.

Optional<Invoice> invoice = list.stream()
.filter(line -> line.getOrderNum().matches("\\D+"))
.findFirst();

What's happening here is that you create a custom Predicate used to filter the stream. It converts the current Invoice to a boolean result.


If you already have a compiled Pattern that you'd like to re-use:

Pattern p = …
Optional<Invoice> invoice = list.stream()
.filter(line -> p.matcher(line.getOrderNum()).matches())
.findFirst();

Count regex matches with streams

To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.

final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();

Performing regex on a stream

You could use a Scanner and the findWithinHorizon method:

Scanner s = new Scanner(new File("thefile"));
String nextMatch = s.findWithinHorizon(yourPattern, 0);

From the api on findWithinHorizon:

If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound. In this case it may buffer all of the input searching for the pattern.

A side note: When matching on multiple lines, you might want to look at the constants Pattern.MULTILINE and Pattern.DOTALL.

Create array of regex matches

(4castle's answer is better than the below if you can assume Java >= 9)

You need to create a matcher and use that to iteratively find matches.

 import java.util.regex.Matcher;
import java.util.regex.Pattern;

...

List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("your regular expression here")
.matcher(yourStringHere);
while (m.find()) {
allMatches.add(m.group());
}

After this, allMatches contains the matches, and you can use allMatches.toArray(new String[0]) to get an array if you really need one.


You can also use MatchResult to write helper functions to loop over matches
since Matcher.toMatchResult() returns a snapshot of the current group state.

For example you can write a lazy iterator to let you do

for (MatchResult match : allMatches(pattern, input)) {
// Use match, and maybe break without doing the work to find all possible matches.
}

by doing something like this:

public static Iterable<MatchResult> allMatches(
final Pattern p, final CharSequence input) {
return new Iterable<MatchResult>() {
public Iterator<MatchResult> iterator() {
return new Iterator<MatchResult>() {
// Use a matcher internally.
final Matcher matcher = p.matcher(input);
// Keep a match around that supports any interleaving of hasNext/next calls.
MatchResult pending;

public boolean hasNext() {
// Lazily fill pending, and avoid calling find() multiple times if the
// clients call hasNext() repeatedly before sampling via next().
if (pending == null && matcher.find()) {
pending = matcher.toMatchResult();
}
return pending != null;
}

public MatchResult next() {
// Fill pending if necessary (as when clients call next() without
// checking hasNext()), throw if not possible.
if (!hasNext()) { throw new NoSuchElementException(); }
// Consume pending so next call to hasNext() does a find().
MatchResult next = pending;
pending = null;
return next;
}

/** Required to satisfy the interface, but unsupported. */
public void remove() { throw new UnsupportedOperationException(); }
};
}
};
}

With this,

for (MatchResult match : allMatches(Pattern.compile("[abc]"), "abracadabra")) {
System.out.println(match.group() + " at " + match.start());
}

yields

a at 0
b at 1
a at 3
c at 4
a at 5
a at 7
b at 8
a at 10

Having multiple Regex in Java 8 Stream to read text from Line

You can simply concatenate the streams:

String inFileName = "Sample.log";
String outFileName = "Sample_output.log";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
List<String> timeStamp = stream
.flatMap(s -> Stream.concat(quoteRegex1.results(s),
Stream.concat(quoteRegex2.results(s), quoteRegex3.results(s))))
.map(r -> r.group(1))
.collect(Collectors.toList());

timeStamp.forEach(System.out::println);
//Files.write(Paths.get(outFileName), dataSet);
}

but note that this will perform three individual searches through each line, which might not only imply lower performance, but also that the order of the matches within one line will not reflect their actual occurrence. It doesn’t seem to be an issue with your patterns, but individual searches even imply possible overlapping matches.

The PatternStreamer of that linked answer also greedily collects the matches of one string into an ArrayList before creating a stream. A Spliterator based solution like in this answer is preferable.

Since numerical group references preclude just combining the patterns in a (pattern1|pattern2|pattern3) manner, a true streaming over matches of multiple different patterns will be a bit more elaborated:

public final class MultiPatternSpliterator
extends Spliterators.AbstractSpliterator<MatchResult> {
public static Stream<MatchResult> matches(String input, String... patterns) {
return matches(input, Arrays.stream(patterns)
.map(Pattern::compile).toArray(Pattern[]::new));
}
public static Stream<MatchResult> matches(String input, Pattern... patterns) {
return StreamSupport.stream(new MultiPatternSpliterator(patterns,input), false);
}
private Pattern[] pattern;
private String input;
private int pos;
private PriorityQueue<Matcher> pendingMatches;

MultiPatternSpliterator(Pattern[] p, String inputString) {
super(inputString.length(), ORDERED|NONNULL);
pattern = p;
input = inputString;
}

@Override
public boolean tryAdvance(Consumer<? super MatchResult> action) {
if(pendingMatches == null) {
pendingMatches = new PriorityQueue<>(
pattern.length, Comparator.comparingInt(MatchResult::start));
for(Pattern p: pattern) {
Matcher m = p.matcher(input);
if(m.find()) pendingMatches.add(m);
}
}
MatchResult mr = null;
do {
Matcher m = pendingMatches.poll();
if(m == null) return false;
if(m.start() >= pos) {
mr = m.toMatchResult();
pos = mr.end();
}
if(m.region(pos, m.regionEnd()).find()) pendingMatches.add(m);
} while(mr == null);
action.accept(mr);
return true;
}
}

This facility allows to match multiple pattern in a (pattern1|pattern2|pattern3) fashion while still having the original groups of each pattern. So when searching for hell and llo in hello, it will find hell and not llo. A difference is that there is no guaranteed order if more than one pattern matches at the same position.

This can be used like

Pattern[] p = Stream.of(reTimeStamp, reHostName, reServiceTime)
.map(Pattern::compile)
.toArray(Pattern[]::new);
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
List<String> timeStamp = stream
.flatMap(s -> MultiPatternSpliterator.matches(s, p))
.map(r -> r.group(1))
.collect(Collectors.toList());

timeStamp.forEach(System.out::println);
//Files.write(Paths.get(outFileName), dataSet);
}

While the overloaded method would allow to use MultiPatternSpliterator.matches(s, reTimeStamp, reHostName, reServiceTime) using the pattern strings to create a stream, this should be avoided within a flatMap operation that would recompile every regex for every input line. That’s why the code above compiles all patterns into an array first. This is what your original code also does by instantiating the PatternStreamers outside the stream operation.



Related Topics



Leave a reply



Submit