Analyzer producing tokens having position `-1`


#1

Hi!

We've got an analyzer using a custom tokenizer:

"analyzer" : {
  "my_analyzer" : {
    "tokenizer" : "my_custom_tokenizer",
    "filter" : [ "lowercase" ]
  }
}

Custom tokenizer being as follows (in Scala code):

class MyCustomTokenizer extends Tokenizer {

  val termAtt = addAttribute(classOf[CharTermAttribute])
  val offsetAtt = addAttribute(classOf[OffsetAttribute])

  @volatile var tokens: Iterator[String] = null
  @volatile var startOffset = 0

  override def reset(): Unit = {
    super.reset()
    tokens = null
    startOffset = 0
  }

  override def incrementToken(): Boolean = {
    if (tokens == null) {
      // set it up
      val in = new BufferedReader(input)
      val lines = Iterator.continually(in.readLine).takeWhile(_ != null)
      val inputStr = lines.mkString(" ") // turn lines to whitespace

      tokens = tokenizeIntoIterator(inputStr) // So this is our function that turns text into Iterator[String]
    }

    if (tokens.hasNext) {
      val token = tokens.next
      val endOffset = startOffset + token.length

      termAtt.setEmpty()
      termAtt.append(token)
      offsetAtt.setOffset(startOffset, endOffset)
      startOffset = endOffset

      true
    } else false
  }

}

Now, the weird thing is that using this my_analyzer we start to get "position": "-1" after a while.

At first, it works as expected:

curl -XGET 'localhost:9200/example/_analyze?pretty=1&analyzer=my_analyzer' -d 'foo bar'
{
  "tokens" : [ {
    "token" : "foo",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bar",
    "start_offset" : 3,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  } ]
}

But pretty soon we start to get responses where position is always -1.

curl -XGET 'localhost:9200/example/_analyze?pretty=1&analyzer=my_analyzer' -d 'foo bar'
{
  "tokens" : [ {
    "token" : "foo",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : -1
  }, {
    "token" : "bar",
    "start_offset" : 3,
    "end_offset" : 6,
    "type" : "word",
    "position" : -1
  } ]
}

What is going on here? Do we have some bug in the MyCustomTokenizer? What kind of affect does that -1 have in searches? We're using Elasticsearch 2.3.2.

Any insights would be greatly appreciated. :slight_smile:


#2

Hmmm....
It seems that setting PositionIncrementAttribute to 1 on reset() seems to have fixed the behavior:

override def reset(): Unit = {
  super.reset()
  tokens = null
  startOffset = 0
  Option(getAttribute(classOf[PositionIncrementAttribute])).
    foreach(_.setPositionIncrement(1))
}

Go figure... :slight_smile:


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.