Length filter: array index out of bounds exception

I encountered a strange issue, when a query fails depending on the position of a term and only then the length filter is active (see below):

  1. works:
GET test/_search?filter_path=**.productNumber
{
  "query": {
    "match": {
      "productNumber": {
        "query": "abc def ghij 3d"
      }
    }
  }
}
  1. fails
GET test/_search?filter_path=**.productNumber
{
  "query": {
    "match": {
      "productNumber": {
        "query": "abc def 3d ghij"
      }
    }
  }
}

Exception:

          "caused_by" : {
            "type" : "array_index_out_of_bounds_exception",
            "reason" : "Index 0 out of bounds for length 0"
          }

ES versions tested: 7.5, 7.6

Index settings:

PUT test
{
  "settings": {
    "number_of_shards": "1",
    "number_of_replicas": "0",
    "analysis": {
      "filter": {
        "length_min_2": {
          "type": "length",
          "min": 2
        },
        "word_split_product_number": {
          "type": "word_delimiter_graph",
          "split_on_numerics": true,
          "generate_number_parts": true,
          "catenate_words": true,
          "catenate_numbers": true,
          "catenate_all": true,
          "preserve_original": true
        }
      },
      "analyzer": {
        "word_split_product_number_analyzer": {
          "filter": [
            "lowercase",
            "word_split_product_number",
            "length_min_2"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "productNumber": {
        "type": "text",
        "analyzer": "word_split_product_number_analyzer"
      }
    }
  }
}

Test docs:

PUT test/_bulk
{"index":{}}
{"productNumber":"ABC-DEF-GHIJ-3A"}
{"index":{}}
{"productNumber":"ABC-DEF-GHIJ-3B"}
{"index":{}}
{"productNumber":"ABC-DEF-GHIJ-3C"}
{"index":{}}
{"productNumber":"ABC-DEF-GHIJ-3D"}

I found the following solution as workaround, but it'd be great to have the length filter working too:
instead of:

        "length_min_2": {
          "type": "length",
          "min": 2
        },

use those:

        "stop_empty": {
          "type": "stop",
          "stopwords": [ "" ]
        },
        "pattern_length_min_2": {
          "type": "pattern_replace",
          "pattern": "^.$",
          "replacement": ""
        },

...
      "analyzer": {
        "word_split_product_number_analyzer": {
          "filter": [
            "lowercase"
            ,"word_split_product_number"
            ,"pattern_length_min_2"
            ,"stop_empty"
            ,"unique"
          ],
          "tokenizer": "whitespace"
        }
      }

The "workaround" above does not work today :frowning:
I'm getting the same error as above. It seems I made some mistakes during testing...

UPDATE:
today's workaround: use the combination word_delimiter (not graph!) + flatten_graph
So, a bug in a graph token stream?

Any feedback from the ES engineers? Thanks!

I opened https://github.com/elastic/elasticsearch/issues/54434 - as the least thing that should happen is either a proper exception or a fix :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.