Not getting results from a phrase query using query_string of the form 'X A1 ABC' in 6.6.0

(Dan) #1

Hi,

I have a document containing the following text

X A1 ABC

When searching for this text using phrase query

"X A1 ABC"

we fail to find the result.

PUT 6a4d8bd1a4d67152c0edd375c996b319
{
  "mappings": {
    "raw": {
      "_source": {
        "includes": [
          "*"
        ],
        "excludes": [
          "OntoAll"
        ]
      },
      "properties": {
        "OntoID": {
          "type": "keyword"
        },
        "OntoAll": {
          "type": "text"
        },
        "OntoFields": {
          "type": "nested",
          "properties": {
            "key": {
              "type": "keyword"
            },
            "value": {
              "type": "text",
              "copy_to": [
                "OntoAll"
              ]
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "OntoFilter": {
            "split_on_numerics": "true",
            "generate_word_parts": "true",
            "preserve_original": "true",
            "catenate_words": "false",
            "generate_number_parts": "true",
            "catenate_all": "false",
            "split_on_case_change": "true",
            "type": "word_delimiter_graph",
            "catenate_numbers": "false"
          }
        },
        "analyzer": {
          "default": {
            "filter": [
              "OntoFilter",
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          }
        }
      }
    }
  }
}

and the document is added as follows:

PUT 6a4d8bd1a4d67152c0edd375c996b319/raw/id1
{
  "OntoID": "S8371",
  "OntoFields": {
    "key": "prop",
    "value": "X A1 ABC"
  }
}

Running the query below gives 0 results

GET 6a4d8bd1a4d67152c0edd375c996b319/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "query_string": {
      "query": "\"X A1 ABC\"",
      "fields": [
        "OntoAll"
      ],
      "tie_breaker": 0,
      "default_operator": "and"
    }
  }
}

Analysis of the phrase doesn't show anything strange about the positions from what I can see, there is a clear path through these tokens which could be satisfied to produce a result. We see 'A', '1', and 'A1' are tokenized as expected:

GET 6a4d8bd1a4d67152c0edd375c996b319/_analyze
{
  "text" : "X A1 ABC",
  "explain" : true
}

<<See second post below for the analyzer output owing to 7000 character limit of this post.>>

As an experiment, trying to use a hyphen as a delimiter between the A1:
"X A-1 ABC"
seems to work fine, so I suspect something about lucene's WordDelimitedGraphFilter may play a part in this issue?

Note "X A1" returns a result, as does "A1 ABC"

Kind regards

Dan

(Dan) #2

Analysis here:

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "whitespace",
      "tokens" : [
        {
          "token" : "X",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0,
          "bytes" : "[58]",
          "positionLength" : 1,
          "termFrequency" : 1
        },
        {
          "token" : "A1",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1,
          "bytes" : "[41 31]",
          "positionLength" : 1,
          "termFrequency" : 1
        },
        {
          "token" : "ABC",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "word",
          "position" : 2,
          "bytes" : "[41 42 43]",
          "positionLength" : 1,
      "termFrequency" : 1
    }
  ]
},
"tokenfilters" : [
  {
    "name" : "OntoFilter",
    "tokens" : [
      {
        "token" : "X",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "word",
        "position" : 0,
        "bytes" : "[58]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "A1",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "word",
        "position" : 1,
        "positionLength" : 2,
        "bytes" : "[41 31]",
        "keyword" : false,
        "positionLength" : 2,
        "termFrequency" : 1
      },
      {
        "token" : "A",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "word",
        "position" : 1,
        "bytes" : "[41]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "1",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "word",
        "position" : 2,
        "bytes" : "[31]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "ABC",
        "start_offset" : 5,
        "end_offset" : 8,
        "type" : "word",
        "position" : 3,
        "bytes" : "[41 42 43]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      }
    ]
  },
  {
    "name" : "lowercase",
    "tokens" : [
      {
        "token" : "x",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "word",
        "position" : 0,
        "bytes" : "[78]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "a1",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "word",
        "position" : 1,
        "positionLength" : 2,
        "bytes" : "[61 31]",
        "keyword" : false,
        "positionLength" : 2,
        "termFrequency" : 1
      },
      {
        "token" : "a",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "word",
        "position" : 1,
        "bytes" : "[61]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "1",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "word",
        "position" : 2,
        "bytes" : "[31]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "abc",
        "start_offset" : 5,
        "end_offset" : 8,
        "type" : "word",
        "position" : 3,
        "bytes" : "[61 62 63]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      }
    ]
  }
]
  }
}
(Mark Harwood) #3

Hi Dan,
You can run the _explain api on the doc you expect to match as follows:

GET test/raw/id1/_explain

Part of that response shows:

"no match on required clause (spanNear([OntoAll:x, spanOr([OntoAll:a1, spanNear([OntoAll:a, OntoAll:1], 0, true)]), OntoAll:abc], 0, true))"

The problem is that the distance between X and ABC is 3 positions not 2. The splitting of A1 into A and 1 has introduced an extra position into the token stream, widening the gap between X and ABC. The solution is to allow a slop factor in the query e.g.

GET test/_search
{
  "from": 0,
  "size": 10,
  "query": {
	"query_string": {
	  "query": "\"X A1 ABC\"~1",
	  "fields": [
		"OntoAll"
	  ],
	  "tie_breaker": 0,
	  "default_operator": "and"
	}
  }
}
(Dan) #4

We tried with a slop factor of 1 and noticed that it returns a result, but shouldn't there be a result still with a slop factor of 0 based on the discovery of the following token sequence:
X : Position 0
a : position 1
1 : position 2
abc : position 3

This is a sequence of tokens 0,1,2,3 which covers the search term in full, so why does this not provide a result? (My understanding is you need a chain of tokens covering the query with the positions no greater distance apart than the slop factor in order to return a result).

(Mark Harwood) #5

I think it's to do with the logic grouping and how that is interpreted. The span description from the explain I posted has this pseudo code:

X near (A1 OR (A near 1)) near ABC

The start position of A1 is 1 whereas the alternative path of (A near 1) is also rooted at the position 1. Once the choices of arbitrarily complex path choices have been resolved it looks like the logic is to assume that this advances the match to the start point of the path not the end point of the longest-matching route.

(Dan) #6

Thank you for your prompt response. I'm still not clear on why the search query X A-1 ABC over indexed text X A1 ABC works (introduction of a hyphen or any other delimiter between A and 1 in the query only), even though the index and query produce the same tokenization elements. Also, the query plan for "X A1 ABC" and "X A-1 ABC" are exactly the same.

So for completeness:

Doesn't match:
Index: X A1 ABC
Search: "X A1 ABC"

Doesn't match:
Index: X A-1 ABC
Search: "X A-1 ABC"

Matches:
Index: X A1 ABC
Search: "X A-1 ABC"

Matches:
Index: X A-1 ABC
Search: "X A1 ABC"

From our perspective, these queries all go through the word delimiter filter, have the same plan, and are structurally the same in terms of their tokenization, so can you explain what logic governs the difference between the matching and non-matching cases?

Kind regards

Dan

(Mark Harwood) #7

Ugh. So the text description in the explain API is the same but there must be some other discrepancy in the actual Lucene Query objects that are actually being used under the covers for those 2 requests. Maybe it just comes down to if A1 or A near 1 appear first in the list of alternative paths to be explored.

(Dan) #8

The problem is that the distance between X and ABC is 3 positions not 2. The splitting of A1 into A and 1 has introduced an extra position into the token stream, widening the gap between X and ABC. The solution is to allow a slop factor ...

Just on this, if we look at the _analyze output, our expectation was that the spanNear would make use of the positionLength in order to allow it to bridge the position gap 0,1,3 without the use of a different slop value.

From the output below, the positionLength of A1 is set to 2, so we would expect the spanNear to use that to know that it should jump from position 1 to 3. Is this correct?

"tokens" : [
      {
        "token" : "X",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "word",
        "position" : 0,
        "bytes" : "[58]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "A1",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "word",
        "position" : 1,
        "positionLength" : 2,
        "bytes" : "[41 31]",
        "keyword" : false,
        "positionLength" : 2,
        "termFrequency" : 1
      },
      {
        "token" : "A",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "word",
        "position" : 1,
        "bytes" : "[41]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "1",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "word",
        "position" : 2,
        "bytes" : "[31]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      },
      {
        "token" : "ABC",
        "start_offset" : 5,
        "end_offset" : 8,
        "type" : "word",
        "position" : 3,
        "bytes" : "[41 42 43]",
        "keyword" : false,
        "positionLength" : 1,
        "termFrequency" : 1
      }
    ]

Just for completeness, I have captured the original query in full, just in case any of the settings on the query affect it:

_search
{
  "query": {
    "query_string": {
      "query": "\"X A1 ABC\"",
      "fields": [
        "OntoAll^1.0"
      ],
      "type": "best_fields",
      "tie_breaker": 0.0,
      "default_operator": "and",
      "max_determinized_states": 10000,
      "enable_position_increments": true,
      "fuzziness": "AUTO",
      "fuzzy_prefix_length": 0,
      "fuzzy_max_expansions": 50,
      "phrase_slop": 0,
      "escape": false,
      "auto_generate_synonyms_phrase_query": true,
      "fuzzy_transpositions": true,
      "boost": 1.0
    }
  },
  "profile": "true", 
  "explain": true,
  "_source": {
    "includes": [],
    "excludes": []
  },
  "highlight": {
    "pre_tags": [
      "<B>"
    ],
    "post_tags": [
      "</B>"
    ],
    "require_field_match": false,
    "encoder": "html",
    "fields": {
      "OntoFields.value": {}
    }
  }
}
(Dan) #9

Ugh. So the text description in the explain API is the same but there must be some other discrepancy in the actual Lucene Query objects that are actually being used under the covers for those 2 requests. Maybe it just comes down to if A1 or A near 1 appear first in the list of alternative paths to be explored.

Do you think this is a question for lucene? (Do you see this as a bug or at least does the behaviour strike you as odd?)

(Mark Harwood) #10

I'm told this looks like a combination of a couple of Lucene issues:

https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-7848

(Dan) #11

Ok, that it's a known issue is somewhat reassuring. I'll wait on the integration of those fixes. As an aside, we ran the test against an old version (3.6) of lucene and it worked fine, so this looks like some sort of regression.

Thanks very much for your help Mark.

1 Like
(Michael Gibney) #12

In case it might be helpful, I thought I'd try to further explain some of the behavior observed here:

This worked in older versions because prior to Lucene 6.5 (see LUCENE-7699), all possible paths through a given query were enumerated, and a separate PhraseQuery was run for each path.

Starting with Lucene 6.5, SpanNearQuery was used instead for these cases, which was more efficient (evaluating possible paths dynamically against the index). But SpanNearQuery also evaluates lazily, to the extent that it misses some valid matches.

The reason this works in your case for "X A-1 ABC" is that your word delimiter is configured mostly to split, but the "preserve_original=true" causes the missed match (and the differing behavior). The exact match on "X1" hits, which prevents the split tokens ("X" "1") from matching. When you introduce a dash (or other delimiter), "X-1" (the original token from the query) does not match, and thus does not prevent the split tokens ("X" "1") from matching.

There's another issue relevant to this case, which could be viewed as a sub-issue of LUCENE-7398:
https://issues.apache.org/jira/browse/LUCENE-4312

An analogous issue is also currently being discussed on the Solr mailing list.