Handling offsets in a custom Lucene tokenizer for multi-value fields

Pyppe · January 29, 2018, 9:11am

Hi!

We've got a custom Lucene tokenizer that after upgrading from Elasticsearch 5.5.2 to 6.1.2 doesn't seem to handle offset correctly when the indexed field has multiple values.

So for example, when trying to index a document field of children.name when there are more than one child:

  {
    "name": "Foo",
    "children": [
      {"name": "Foo Bar"},
      {"name": "Bar Xyzzy"}
    ]
  }

we will receive an error of saying:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1,endOffset=3,lastStartOffset=3 for field 'children.name.our_custom_analyzer'

This is basically our Tokenizer (in Scala) which used to work in Elasticsearch 5.5.2:

class MyCustomTokenizer extends Tokenizer {

  val termAtt = addAttribute(classOf[CharTermAttribute])
  val offsetAtt = addAttribute(classOf[OffsetAttribute])

  @volatile var tokens: Iterator[String] = null
  @volatile var startOffset = 0

  override def reset(): Unit = {
    super.reset()
    tokens = null
    startOffset = 0
    Option(getAttribute(classOf[PositionIncrementAttribute])).
      foreach(_.setPositionIncrement(1))
  }

  override def incrementToken(): Boolean = {
    if (tokens == null) {
      val in = new BufferedReader(input)
      val lines = Iterator.continually(in.readLine).takeWhile(_ != null)
      val inputStr = lines.mkString(" ") // turn lines to whitespace

      val tokenList: List[String] = OurCodeThatConvertsInputTextIntoListOfTokens.doIt(inputStr)

      tokens = tokenList.iterator
    }

    if (tokens.hasNext) {
      val token = tokens.next
      val endOffset = startOffset + token.length

      termAtt.setEmpty().append(token)
      offsetAtt.setOffset(startOffset, endOffset)
      startOffset = endOffset

      true
    } else false
  }

}

So what are we doing wrong here? If the reset is called between children (in this case Foo Bar and Bar Xyzzy), how are we supposed to be track the "correct" offset?

Pyppe · January 29, 2018, 9:42am

OK, I guess I found the reason. We should override the end(): Unit to correct the offset:

  override def end(): Unit = {
    val ofs = correctOffset(startOffset - 1)
    offsetAtt.setOffset(ofs, ofs)
  }

Now I see a response, I was aiming for:

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'{"analyzer" : "our_custom_analyzer","text" : ["Foo Bar", "#xyzzy"]}'
{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "#",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "xyzzy",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 4
    }
  ]
}

However, is there some rule of thumb about not using the same startOffset as the previous position's endOffset? Because for example the standard analyzer seems to always have nextStartOffset = previousEndOffset + 1:

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'{"analyzer" : "standard","text" : ["Foo Bar"]}'
{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

system · February 26, 2018, 9:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Start and end offset of a token in elasticsearch Elasticsearch	2	913	July 6, 2017
Analyzer producing tokens having position `-1` Elasticsearch	2	461	March 10, 2017
Custom tokenizer doesn't work on reindex/index api, only _analyze endpoint Elasticsearch	8	2571	October 24, 2017
Mapping offsets to matching tokens Elasticsearch	2	686	July 5, 2017
Howto: Access Character Offset of term in string field Elasticsearch	4	582	July 6, 2017

Handling offsets in a custom Lucene tokenizer for multi-value fields

Related topics