Hi!
We've got an analyzer using a custom tokenizer:
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "my_custom_tokenizer",
"filter" : [ "lowercase" ]
}
}
Custom tokenizer being as follows (in Scala code):
class MyCustomTokenizer extends Tokenizer {
val termAtt = addAttribute(classOf[CharTermAttribute])
val offsetAtt = addAttribute(classOf[OffsetAttribute])
@volatile var tokens: Iterator[String] = null
@volatile var startOffset = 0
override def reset(): Unit = {
super.reset()
tokens = null
startOffset = 0
}
override def incrementToken(): Boolean = {
if (tokens == null) {
// set it up
val in = new BufferedReader(input)
val lines = Iterator.continually(in.readLine).takeWhile(_ != null)
val inputStr = lines.mkString(" ") // turn lines to whitespace
tokens = tokenizeIntoIterator(inputStr) // So this is our function that turns text into Iterator[String]
}
if (tokens.hasNext) {
val token = tokens.next
val endOffset = startOffset + token.length
termAtt.setEmpty()
termAtt.append(token)
offsetAtt.setOffset(startOffset, endOffset)
startOffset = endOffset
true
} else false
}
}
Now, the weird thing is that using this my_analyzer
we start to get "position": "-1"
after a while.
At first, it works as expected:
curl -XGET 'localhost:9200/example/_analyze?pretty=1&analyzer=my_analyzer' -d 'foo bar'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}, {
"token" : "bar",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}
But pretty soon we start to get responses where position is always -1
.
curl -XGET 'localhost:9200/example/_analyze?pretty=1&analyzer=my_analyzer' -d 'foo bar'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : -1
}, {
"token" : "bar",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : -1
} ]
}
What is going on here? Do we have some bug in the MyCustomTokenizer
? What kind of affect does that -1
have in searches? We're using Elasticsearch 2.3.2
.
Any insights would be greatly appreciated.