Wrong positionLength using output_unigrams=false

alexander.korkhov · January 30, 2025, 6:23am

For example, I have next settings:

PUT test_search_05
{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingle": {
          "tokenizer": "standard",
          "filter": [ "my_shingle_filter" ]
        }
      },
      "filter": {
        "my_shingle_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 5,
          "output_unigrams": false
        }
      }
    }
  }
}

Next, analyze

GET test_search_05/_analyze
{
  "analyzer": "shingle",
  "text": "quick brown fox"
}

I see:

{
  "tokens" : [
    {
      "token" : "quick brown",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "quick brown fox",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "brown fox",
      "start_offset" : 6,
      "end_offset" : 15,
      "type" : "shingle",
      "position" : 1
    }
  ]
}

There are wrong positionLength in all tokens (one less).
Without {"output_unigrams": false} it works correctly.

How can I resolve that problem?

Topic		Replies	Views
Issue when combining shingle filter and stopwords Elasticsearch	2	949	July 5, 2017
Min shingle size Elasticsearch	5	1878	April 20, 2017
Wrong results appearing in elasticsearch must match query Elasticsearch	1	683	March 6, 2019
Filter split issue Elasticsearch	2	2115	July 5, 2017
Query using Shingle filter on 5.1.2 and 5.2.2 behave differently Elasticsearch	3	390	September 3, 2019

Wrong positionLength using output_unigrams=false

Related topics