Yup. It's calling my tokenizer. But now it's revealed that my tokenizer is in fact crap!
Caused by: java.lang.IndexOutOfBoundsException
at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.append(CharTermAttributeImpl.java:131)
at com.cameraforensics.elasticsearch.plugins.UrlTokenizer.incrementToken(UrlTokenizer.java:30)
Probably because - as there are no docs - I'm doing it wrong.
@Override
public boolean incrementToken() throws IOException {
if (position >= tokens.size()) {
return false;
} else {
termAtt.setEmpty().append(tokens.get(position), position, position);
position++;
return true;
}
}
tokens
is a list of all permutations of index segmentation (as per this: Performance of doc_values field vs analysed field)
I'm not really sure what the two int values should be on CharTermAttribute#append
, so I'm guessing - incorrectly.
Anyway, thanks for all of your help. I'll keep hacking!