Match_phrase_prefix acts erratically when there is a filter

(Pedro Lopes) #1

I'm trying to make sense of the behaviour of this query:

GET /categories/category/_search
  "query": {
    "bool": {
      "filter": {
        "term": {
          "classification_system_id": 1
      "must": {
        "match_phrase_prefix": {
          "category_id": {
            "query": "13",
            "max_expansions" : 20

The term filter matches about 200 out of 26000 documents in the index.

The results of the match_phrase_prefix are highly unpredictable, it returns fewer results than it should for some prefixes but not others. For example if I search for prefix "13" it should find 10 but returns only 1, but for prefix "19" it returns all 11 results as it should. Other values are equally unpredictable. If the filter is removed the problem goes away, as far as I can tell.

Things improve only by bumping the max_expansions way up. To get the expected results for 2 character prefixes requires max_expansions around 200, and for 1 character prefixes around 1000!

Is this normal? If so, what is the proper solution? Setting max_expansions to such high numbers appears to be a bad idea, if the documentation is to be believed.

(Isabel Drost-Fromm) #2

Also posting the documents you indexed (or a minimal set of them) would help reproduce your issue.

In general I think the following issue might be relevant to your question:

(Pedro Lopes) #3

Yes, that appears to be the same problem. Thank you for the pointer.

(Nik Everett) #4

The trouble with max expansions is they that are done against the full
terms dictionary. It'd be super expensive to do them in a filtered context.
If you really want a prefix search you can analyze the string with edge
ngrams. That makes for a larger index but there is no expansion to do so
the querying is faster. You'd set the index_analyzer with edge_ngrams and
the search analyzer without.

(Pedro Lopes) #5

Yeah, it later occurred to me that the strange behaviour would be explained if the filter was being applied after the max_expansions, where I had assumed it was applied before. Thanks for clarifying that.

This should probably be mentioned in the documentation of match_phrase_prefix, because the behaviour is unexpected if you don't know the inner workings of ES.

(system) #6