How do I create a Stream of regex matches?
Well, in Java 8, there is Pattern.splitAsStream
which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.
If you are going to implement such a Stream
, I recommend implementing Spliterator
directly rather than implementing and wrapping an Iterator
. You may be more familiar with Iterator
but implementing a simple Spliterator
is straight-forward:
final class MatchItr extends Spliterators.AbstractSpliterator<String> {
private final Matcher matcher;
MatchItr(Matcher m) {
super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
matcher=m;
}
public boolean tryAdvance(Consumer<? super String> action) {
if(!matcher.find()) return false;
action.accept(matcher.group());
return true;
}
}
You may consider overriding forEachRemaining
with a straight-forward loop, though.
If I understand your attempt correctly, the solution should look more like:
Pattern pattern = Pattern.compile(
"[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");
try(BufferedReader br=new BufferedReader(System.console().reader())) {
br.lines()
.flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
Java 9 provides a method Stream<MatchResult> results()
directly on the Matcher
. But for finding matches within a stream, there’s an even more convenient method on Scanner
. With that, the implementation simplifies to
try(Scanner s = new Scanner(System.console().reader())) {
s.findAll(pattern)
.collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
This answer contains a back-port of Scanner.findAll
that can be used with Java 8.
Pattern matching using java streams
You can use the method Matcher.results()
to get a stream of match results for each subsequence of the input sequence that matches the pattern. That will also enables you to do the task in one go instead of the intermadiate steps you are doing right now like storing sub-results in a list.
List<String> catalogIds =
IntStream.range(0, fragmentedHeader.length - 1)
.filter(index -> fragmentedHeader[index+1].contains("ads_management"))
.mapToObj(index -> fragmentedHeader[index])
.flatMap(str -> Pattern.compile("[0-9]{15}").matcher(str).results())
.map(MatchResult::group)
.collect(Collectors.toList());
Match a pattern and write the stream to a file using Java 8 Stream
Unfortunately, the Java regular expression classes don't provide a stream for matched results, only a splitAsStream()
method, but you don't want split.
Note: It has been added in Java 9 as Matcher.results().
You can however create a generic helper class for it yourself:
public final class PatternStreamer {
private final Pattern pattern;
public PatternStreamer(String regex) {
this.pattern = Pattern.compile(regex);
}
public Stream<MatchResult> results(CharSequence input) {
List<MatchResult> list = new ArrayList<>();
for (Matcher m = this.pattern.matcher(input); m.find(); )
list.add(m.toMatchResult());
return list.stream();
}
}
Then your code becomes easy by using flatMap()
:
private static final PatternStreamer quoteRegex = new PatternStreamer("\"([^\"]*)\"");
public static void main(String[] args) throws Exception {
String inFileName = "c:\\exec.log";
String outFileName = "c:\\exec_quoted.txt";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
Set<String> dataSet = stream.flatMap(quoteRegex::results)
.map(r -> r.group(1))
.collect(Collectors.toSet());
Files.write(Paths.get(outFileName), dataSet);
}
}
Since you only process a line at a time, the temporary List
is fine. If the input string is very long and will have a lot of matches, then a Spliterator
would be a better choice. See How do I create a Stream of regex matches?
Java Stream filter with regex not working
You're almost there.
Optional<Invoice> invoice = list.stream()
.filter(line -> line.getOrderNum().matches("\\D+"))
.findFirst();
What's happening here is that you create a custom Predicate
used to filter
the stream. It converts the current Invoice
to a boolean result.
If you already have a compiled Pattern
that you'd like to re-use:
Pattern p = …
Optional<Invoice> invoice = list.stream()
.filter(line -> p.matcher(line.getOrderNum()).matches())
.findFirst();
Count regex matches with streams
To use the Pattern::splitAsStream
properly you have to invert your regex. That means instead of having \\d+
(which would split on every number) you should use \\D+
. This gives you ever number in your String.
final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();
Performing regex on a stream
You could use a Scanner
and the findWithinHorizon
method:
Scanner s = new Scanner(new File("thefile"));
String nextMatch = s.findWithinHorizon(yourPattern, 0);
From the api on findWithinHorizon
:
If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound. In this case it may buffer all of the input searching for the pattern.
A side note: When matching on multiple lines, you might want to look at the constants Pattern.MULTILINE
and Pattern.DOTALL
.
Create array of regex matches
(4castle's answer is better than the below if you can assume Java >= 9)
You need to create a matcher and use that to iteratively find matches.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
...
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("your regular expression here")
.matcher(yourStringHere);
while (m.find()) {
allMatches.add(m.group());
}
After this, allMatches
contains the matches, and you can use allMatches.toArray(new String[0])
to get an array if you really need one.
You can also use MatchResult
to write helper functions to loop over matches
since Matcher.toMatchResult()
returns a snapshot of the current group state.
For example you can write a lazy iterator to let you do
for (MatchResult match : allMatches(pattern, input)) {
// Use match, and maybe break without doing the work to find all possible matches.
}
by doing something like this:
public static Iterable<MatchResult> allMatches(
final Pattern p, final CharSequence input) {
return new Iterable<MatchResult>() {
public Iterator<MatchResult> iterator() {
return new Iterator<MatchResult>() {
// Use a matcher internally.
final Matcher matcher = p.matcher(input);
// Keep a match around that supports any interleaving of hasNext/next calls.
MatchResult pending;
public boolean hasNext() {
// Lazily fill pending, and avoid calling find() multiple times if the
// clients call hasNext() repeatedly before sampling via next().
if (pending == null && matcher.find()) {
pending = matcher.toMatchResult();
}
return pending != null;
}
public MatchResult next() {
// Fill pending if necessary (as when clients call next() without
// checking hasNext()), throw if not possible.
if (!hasNext()) { throw new NoSuchElementException(); }
// Consume pending so next call to hasNext() does a find().
MatchResult next = pending;
pending = null;
return next;
}
/** Required to satisfy the interface, but unsupported. */
public void remove() { throw new UnsupportedOperationException(); }
};
}
};
}
With this,
for (MatchResult match : allMatches(Pattern.compile("[abc]"), "abracadabra")) {
System.out.println(match.group() + " at " + match.start());
}
yields
a at 0
b at 1
a at 3
c at 4
a at 5
a at 7
b at 8
a at 10
Having multiple Regex in Java 8 Stream to read text from Line
You can simply concatenate the streams:
String inFileName = "Sample.log";
String outFileName = "Sample_output.log";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
List<String> timeStamp = stream
.flatMap(s -> Stream.concat(quoteRegex1.results(s),
Stream.concat(quoteRegex2.results(s), quoteRegex3.results(s))))
.map(r -> r.group(1))
.collect(Collectors.toList());
timeStamp.forEach(System.out::println);
//Files.write(Paths.get(outFileName), dataSet);
}
but note that this will perform three individual searches through each line, which might not only imply lower performance, but also that the order of the matches within one line will not reflect their actual occurrence. It doesn’t seem to be an issue with your patterns, but individual searches even imply possible overlapping matches.
The PatternStreamer
of that linked answer also greedily collects the matches of one string into an ArrayList
before creating a stream. A Spliterator
based solution like in this answer is preferable.
Since numerical group references preclude just combining the patterns in a (pattern1|pattern2|pattern3)
manner, a true streaming over matches of multiple different patterns will be a bit more elaborated:
public final class MultiPatternSpliterator
extends Spliterators.AbstractSpliterator<MatchResult> {
public static Stream<MatchResult> matches(String input, String... patterns) {
return matches(input, Arrays.stream(patterns)
.map(Pattern::compile).toArray(Pattern[]::new));
}
public static Stream<MatchResult> matches(String input, Pattern... patterns) {
return StreamSupport.stream(new MultiPatternSpliterator(patterns,input), false);
}
private Pattern[] pattern;
private String input;
private int pos;
private PriorityQueue<Matcher> pendingMatches;
MultiPatternSpliterator(Pattern[] p, String inputString) {
super(inputString.length(), ORDERED|NONNULL);
pattern = p;
input = inputString;
}
@Override
public boolean tryAdvance(Consumer<? super MatchResult> action) {
if(pendingMatches == null) {
pendingMatches = new PriorityQueue<>(
pattern.length, Comparator.comparingInt(MatchResult::start));
for(Pattern p: pattern) {
Matcher m = p.matcher(input);
if(m.find()) pendingMatches.add(m);
}
}
MatchResult mr = null;
do {
Matcher m = pendingMatches.poll();
if(m == null) return false;
if(m.start() >= pos) {
mr = m.toMatchResult();
pos = mr.end();
}
if(m.region(pos, m.regionEnd()).find()) pendingMatches.add(m);
} while(mr == null);
action.accept(mr);
return true;
}
}
This facility allows to match multiple pattern in a (pattern1|pattern2|pattern3)
fashion while still having the original groups of each pattern. So when searching for hell
and llo
in hello
, it will find hell
and not llo
. A difference is that there is no guaranteed order if more than one pattern matches at the same position.
This can be used like
Pattern[] p = Stream.of(reTimeStamp, reHostName, reServiceTime)
.map(Pattern::compile)
.toArray(Pattern[]::new);
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
List<String> timeStamp = stream
.flatMap(s -> MultiPatternSpliterator.matches(s, p))
.map(r -> r.group(1))
.collect(Collectors.toList());
timeStamp.forEach(System.out::println);
//Files.write(Paths.get(outFileName), dataSet);
}
While the overloaded method would allow to use MultiPatternSpliterator.matches(s, reTimeStamp, reHostName, reServiceTime)
using the pattern strings to create a stream, this should be avoided within a flatMap
operation that would recompile every regex for every input line. That’s why the code above compiles all patterns into an array first. This is what your original code also does by instantiating the PatternStreamer
s outside the stream operation.
Related Topics
How to Load Files into My Java Application
Spring - How to Use Multiple Transaction Managers in the Same Application
How to Find the Index of an Element in an Int Array
Differencebetween Synchronized on Lockobject and Using This as the Lock
What Does Maven Update Project Do in Eclipse
How to Give System Property to My Test via Gradle and -D
Why Does Java Code with an Inner Class Generates a Third Someclass$1.Class File
How to Determine the Primitive Type of a Primitive Variable
How to Concatenate Characters in Java
How to Convert a Java Object to Xml with Open Source APIs
What Code Does the Compiler Generate for Autoboxing
How to Disable 'X-Frame-Options' Response Header in Spring Security
How to Use Bufferedreader in Java
Getting Xml Node Text Value with Java Dom
Java 8: Mandatory Checked Exceptions Handling in Lambda Expressions. Why Mandatory, Not Optional