Position offsets incorrect on analysis - not getting results from a phrase query using query_string


(Rory) #1

Hi, got a weird situation where I was expecting to get a result for a phrase query but I'm getting nothing.
This is in 6.2.3 and in 6.3
The index structure is as follows:

PUT testing
{
  "mappings": {
    "raw": {
      "_source": {
        "includes": [
          "*"
        ],
        "excludes": [
          "OntoAll"
        ]
      },
      "properties": {
        "OntoID": {
          "type": "keyword"
        },
        "OntoAll": {
          "type": "text"
        },
        "OntoFields": {
          "type": "nested",
          "properties": {
            "key": {
              "type": "keyword"
            },
            "value": {
              "type": "text",
              "copy_to": [
                "OntoAll"
              ]
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "OntoFilter": {
            "split_on_numerics": "true",
            "generate_word_parts": "true",
            "preserve_original": "true",
            "catenate_words": "false",
            "generate_number_parts": "true",
            "catenate_all": "false",
            "split_on_case_change": "true",
            "type": "word_delimiter_graph",
            "catenate_numbers": "false"
          }
        },
        "analyzer": {
          "default": {
            "filter": [
              "OntoFilter",
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          }
        }
      }
    }
  }
}

and the document is added as follows:

PUT testing/raw/id1
{
  "OntoID": "S8371",
  "OntoFields": {
    "key": "prop",
    "value": "SAL_S8371 - SAL SUBTEND 0001"
  }
}

Running the query below gives 0 results

GET testing/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "query_string": {
      "query": "\"SAL_S8371 - SAL SUBTEND 0001\"",
      "fields": [
        "OntoAll"
      ],
      "tie_breaker": 0,
      "default_operator": "and"
    }
  }
}

I did notice something strange when analysing the query - the position offset skips over a value at position 4 when it encounters the '-' character

GET testing/_analyze
{
  "text": [
    "SAL_S8371 - SAL SUBTEND 0001"
  ]
}

{
  "tokens": [
    {
      "token": "sal_s8371",
      "start_offset": 0,
      "end_offset": 9,
      "type": "word",
      "position": 0,
      "positionLength": 3
    },
    {
      "token": "sal",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "s",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "8371",
      "start_offset": 5,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "-",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 3
    },
    {
      "token": "sal",
      "start_offset": 12,
      "end_offset": 15,
      "type": "word",
      "position": 5
    },
    {
      "token": "subtend",
      "start_offset": 16,
      "end_offset": 23,
      "type": "word",
      "position": 6
    },
    {
      "token": "0001",
      "start_offset": 24,
      "end_offset": 28,
      "type": "word",
      "position": 7
    }
  ]
}

Without the '-' char in the query string, the positions increment as expected: 0-5 with no gaps

After I noticed this, I added a "phrase_slop": 1 to the query and I then get the result back but this shouldn't be required?

For info, the source/excludes in the mapping is due to that field getting populated directly as well as via the copy_to in the nested document. Could this be the cause of the offset skipping and the query failing?

Thanks


(Rory) #2

Was a bug in lucene word_delimiter_graph analysis - see https://issues.apache.org/jira/browse/LUCENE-8395


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.