Handling offsets in a custom Lucene tokenizer for multi-value fields


#1

Hi!

We've got a custom Lucene tokenizer that after upgrading from Elasticsearch 5.5.2 to 6.1.2 doesn't seem to handle offset correctly when the indexed field has multiple values.

So for example, when trying to index a document field of children.name when there are more than one child:

  {
    "name": "Foo",
    "children": [
      {"name": "Foo Bar"},
      {"name": "Bar Xyzzy"}
    ]
  }

we will receive an error of saying:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1,endOffset=3,lastStartOffset=3 for field 'children.name.our_custom_analyzer'

This is basically our Tokenizer (in Scala) which used to work in Elasticsearch 5.5.2:

class MyCustomTokenizer extends Tokenizer {

  val termAtt = addAttribute(classOf[CharTermAttribute])
  val offsetAtt = addAttribute(classOf[OffsetAttribute])

  @volatile var tokens: Iterator[String] = null
  @volatile var startOffset = 0

  override def reset(): Unit = {
    super.reset()
    tokens = null
    startOffset = 0
    Option(getAttribute(classOf[PositionIncrementAttribute])).
      foreach(_.setPositionIncrement(1))
  }

  override def incrementToken(): Boolean = {
    if (tokens == null) {
      val in = new BufferedReader(input)
      val lines = Iterator.continually(in.readLine).takeWhile(_ != null)
      val inputStr = lines.mkString(" ") // turn lines to whitespace

      val tokenList: List[String] = OurCodeThatConvertsInputTextIntoListOfTokens.doIt(inputStr)

      tokens = tokenList.iterator
    }

    if (tokens.hasNext) {
      val token = tokens.next
      val endOffset = startOffset + token.length

      termAtt.setEmpty().append(token)
      offsetAtt.setOffset(startOffset, endOffset)
      startOffset = endOffset

      true
    } else false
  }

}

So what are we doing wrong here? If the reset is called between children (in this case Foo Bar and Bar Xyzzy), how are we supposed to be track the "correct" offset?


#2

OK, I guess I found the reason. We should override the end(): Unit to correct the offset:

  override def end(): Unit = {
    val ofs = correctOffset(startOffset - 1)
    offsetAtt.setOffset(ofs, ofs)
  }

Now I see a response, I was aiming for:

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'{"analyzer" : "our_custom_analyzer","text" : ["Foo Bar", "#xyzzy"]}'
{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "#",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "xyzzy",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 4
    }
  ]
}

However, is there some rule of thumb about not using the same startOffset as the previous position's endOffset? Because for example the standard analyzer seems to always have nextStartOffset = previousEndOffset + 1:

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'{"analyzer" : "standard","text" : ["Foo Bar"]}'
{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.