Adding tokens to a lucene tokenstream
1.
How the Attribute based API works is, that every TokenStream
in your analyzer chain somehow modifies the state of some Attribute
s on every call of incrementToken()
. The last element in your chain then produces the final tokens.
Whenever the client of your analyzer chain calls incrementToken()
, the last TokenStream
would set the state of some Attribute
s to whatever is necessary to represent the next token. If it is unable to do so, it may call incrementToken()
on its input, to let the previous TokenStream
do its work. This goes on until the last TokenStream
returns false
, indicating, that no more tokens are available.
A captureState
copies the state of all Attribute
s of the calling TokenStream
into a State
, a restoreState
overwrites every Attribute
's state with whatever was captured before (is given as an argument).
The way your token filter works is, it will call input.incrementToken()
, so that the previous TokenStream
will set the Attribute
s' state to what would be the next token. Then, if your defined condition holds (say, the termAtt is "b"), it would add "bb" to a stack, save this state somewhere and return true, so that the client may consume the token. On the next call of incrementToken()
, it would not use input.incrementToken()
. Whatever the current state is, it represents the previous, already consumed token. The filter then restores the state, so that everything is exactly as it was before, and then produces "bb" as the current token and returns true, so that the client may consume the token. Only on the next call, it would (again) consume the next token from the previous filter.
This won't actually produce the graph you displayed, but insert "bb"
after "b"
, so it's really
(a) -> (b) -> (bb) -> (c)
So, why do you save the state in the first place?
When producing tokens, you want to make sure, that e.g. phrase queries or highlighting will work correctly. When you have the text "a b c"
and "bb"
is a synonym for "b"
, you'd expect the phrase query "b c"
to work, as well as "bb c"
. You have to tell the index, that both, "b" and "bb" are in the same position. Lucene uses a position increment for that and per default, the position increment is 1, meaning that every new token (read, call of incrementToken()
) comes 1 position after the previous one. So, with the final positions, the produces stream is
(a:1) -> (b:2) -> (bb:3) -> (c:4)
while you actually want
(a:1) — -> (b:2) -> — (c:3)
\ /
-> (bb:2) ->
So, for your filter to produce the graph, you have to set the position increment to 0 for the inserted "bb"
private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
// later in incrementToken
restoreState(savedState);
posIncAtt.setPositionIncrement(0);
termAtt.setEmpty().append(extraTokens.remove());
The restoreState
makes sure, that other attributes, like offsets, token types, etc. are preserved and you only have to change the ones, that are required for your use case.
Yes, you are overwriting whatever state was there before restoreState
, so it is your responsibility to use this in the right place. And as long as you don't call input.incrementToken()
, you don't advance the input stream, so you can do whatever you want with the state.
2.
A stemmer only changes the token, it typically doesn't produce new tokens nor changes the position increment or offsets.
Also, as the position increment means, that the current term should come positionIncrement
positions after the previous token, you should have qux
with an increment of 1, because it is the next token after of
and bar
should have an increment of 0 because it is in the same position as qux
. The table would rather look like
+--------+---------------+-----------+--------------+-----------+
| Term | startOffset | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
| fo | 0 | 3 | 1 | 1 |
| qux | 4 | 11 | 1 | 2 |
| bar | 4 | 7 | 0 | 1 |
| baz | 8 | 11 | 1 | 1 |
+--------+---------------+-----------+--------------+-----------+
As a basic rule, for multi-term synonyms, where "ABC" is a synonym for "a b c", you should see, that
- positionIncrement("ABC") > 0 (the increment of the first token)
- positionIncrement(*) >= 0 (positions must not go backwards)
- startOffset("ABC") == startOffset("a") and endOffset("ABC") == endOffset("c")
- actually, tokens at the same (start|end) position must have the same (start|end) offset
Hope this helps to shed some light.
Get field's tokens from lucene index
The TokenSources class is a helper class to retrieve the tokens of a document for highlighting purposes. There are two ways to retrieve the terms for a given document:
- re-analyzing a stored field,
- reading the document's terms vector.
The method you want to use tries to read the document's terms vector, but fails because you didn't enable term vectors at indexing time.
So you can either enable term vectors at indexing time and keep using this method (see Field constructor and the documentation of Field.TermVector) or re-analyze the content of your stored fields. The first method may provide better performance, especially for large fields whereas the second one will save space (there is no additional information to store if your field is already stored).
How to use a Lucene Analyzer to tokenize a String?
As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):
public final class LuceneUtils {
public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
}
}
Related Topics
Calling Static Method from Another Java Class
How to Execute .SQL Script File Using Jdbc
Circular Dependency in Java Constructors
Best Language to Parse Extremely Large Excel 2007 Files
Why I'm Not Able to Unwrap and Serialize a Java Map Using the Jackson Java Library
Java: Rationale of the Cloneable Interface
What Version of Javac Built My Jar
Check If Int Is Between Two Numbers
Allowing the "Enter" Key to Press the Submit Button, as Opposed to Only Using Mouseclick
Java JSON Serialization - Best Practice
Equivalent of Waitforvisible/Waitforelementpresent in Selenium Webdriver Tests Using Java
How to Get a Client's MAC Address from Httpservlet
How to "Pretty Print" a Duration in Java
How to Escape Reserved Words in Hibernate's Hql