Why does elasticsearch return inconsistent result on version 7.8.1 and return correct result on 5.6.7 for the same match_phrase_prefix query? How can i get consistent result on elasticsearch 7.8.1?

Hi there,
I have observed inconsistency in the result for the below match_phrase_prefix query on Elasticsearch 7.8.1.

QUERY:

{
    "sort": [{
            "created": {
                "order": "desc"
            }
        }
    ],
    "size": 10000,
    "query": {
        "bool": {
            "should": [{
                    "match_phrase_prefix": {
                        "name": {
                            "query": "Trust",
                            "slop": 100,
                            "max_expansions": 50,
                            "boost": 1.0
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
        }    }
}

RESULTS

ELS v5.6.7
8910 - should -starts with Trust | max_expansion:50 | CORRECT RESULT
991 - must_not -Trust | max_expansion:50 | CORRECT RESULT
9901 - total doc count

ELS v7.8.1
250 - should -Trust | max_expansion:50 | WRONG RESULT
9651 - must_not -Trust | max_expansion:50 | WRONG RESULT
9901 - total doc count

250 - should -Trust | max_expansion:500 | WRONG RESULT
7401 - must_not -Trust | max_expansion:500 | WRONG RESULT
9901 - total doc count

8910 - should -Trust | max_expansion:5000 | CORRECT RESULT
991 - must_not -Trust | max_expansion:5000 | CORRECT RESULT
9901 - total doc count

8910 - should -Trust | max_expansion:10000 | CORRECT RESULT
991 - must_not -Trust | max_expansion:10000 | CORRECT RESULT
9901 - total doc count

8910 - should -starts with Trust | max_expansion:50 | CORRECT RESULT | after removing suffix analyzer
991 - must_not -Trust | max_expansion:50 | CORRECT RESULT | after removing suffix analyzer
9901 - total doc count

MAPPINGS on both versions is the same as below,

mappings {
                    Test {
                        "properties" {
                            "fullText" {
                                "type" "text"
                            }
                            "name" {
                                "type" "text"
                                "store" true
                                "fielddata" true
                                "copy_to" "fullText"
                                "fields" {
                                    "raw" {
                                        "type" "keyword"
                                    }
                                    "suffix" {
                                        "type" "text"
                                        "analyzer" "suffix_match_analyzer"
                                    }
                                }
                            }
                            "created" { "type" "date" }
                            "dbId" {
                                "type" "long"
                                "copy_to" "fullText"
                            }
                            "empId" {
                                "type" "long"
                                "copy_to" "fullText"
                            }
                            "active" { "type" "boolean" }
                            "salary" { "type" "float" }
                            "category" { "type" "keyword" }
                            "sub-category" { "type" "keyword" }
                            "address" {
                                "type" "keyword"
                                "copy_to" "fullText"
                                "fields" {
                                    "raw" {
                                        "type" "ip"
                                        "ignore_malformed" "true"
                                    }
                                }
                            }
                        }
                    }
                }

NOTE: It is seen that increasing max_expansion on V7.8.1 to a certain value returns the correct result.

For the same query and mappings, I'm getting the correct result on V5.6.7.
How do I get this result on V7.8.1?Do I need to update mappings or form a different query for V7.8.1?and if so why does it work on V5.6.7 and not on V7.8.1?

Thanks,
Animesh

Match phrase prefix is designed to help auto complete someone typing "Trust f" by perhaps suggesting "Trust fund".
It does this by taking the last word in the string and finding up to N words that begin with those characters and using those N words in combination with the preceding words in the string. The first 5 words in the index that begin with f might be "fab", "fad", "far", "fast", "fat". This would produce searches for the phrases "Trust fab", "Trust fad".... etc
Not very clever because these word pairings may never have been used together. This is why the docs recommend looking at alternatives.

As for why things have changed in your example:
When you search for "Trust" using this query it is expanded into words matching "Trust*" - so words like "trustworthy" and "trustworthiness".

Each of these expanded words are found in the current shard so if you are using 5.6 with default settings of 5 shards each shard will hold one fifth of all the docs so a max_expansion setting of 10 may be enough to discover all "Trust*" words in a small shard.
In 7.x the default is to have a single shard so all docs and all words will be in the same shard. A max_expansion setting of 10 may not be enough to list all the "Trust*" words in this larger index.

It's probably easier to start by describing what business problem you're trying to solve and then we can recommend how best to implement it.

1 Like

Hi @Mark_Harwood,
Thanks for the explanation.

Let me explain our business Scenario/Issue.

We currently have an index on v7.8.1 with millions of documents in it where the field "name" is of type "text"(please refer to mappings given in thread description )
.We want to search based on this field and get the correct results all the time.
for e.g. if a user searches(through UI) for

case 1: "not name: Trust" (internally this is converted into 
                               match_phrase_prefix query with must_not)

   It should return all the docs in which name does not start with Trust
case 2: "name: Trust" (internally this is converted into 
                               match_phrase_prefix query with should)

   It should return all the docs in which name starts with Trust

Applied Solutions :

Case 1: PASSED
  Created Index : match-phase-prefix-test-v7-search-as-you-type 
  Type of field "name" : search_as_you_type
  Newly Inserted Doc Count: 10K to 1M
	
  Result: both searches i.e "name: Trust" and "not name: Trust" 
returned CORRECT docs count with **max_expansion: 50**
		
	
Case 2: FAILED
	Created Index : match-phase-prefix-test-v7-text
	Type of field "name": text(with new data)
	Newly Inserted Doc Count: 10K to 0.1M
	
	Result: both searches i.e "name: Trust" and "not name: Trust"
returned INCORRECT docs count **with max_expansion: 50**
	
Case 3:	FAILED (if we reindex existing data with field type text to a newer index with field type search_as_you_type)
	Reindexed:
				{
				    "source": {
				        "index": "match-phase-prefix-test-v7-text-10K"
				    },
				    "dest": {
				        "index": "match-phase-prefix-test-v7-reindex-search-as-you-type-10k"
				    }
				}
	Reindexed Doc Count: 10K
	
	Result: both searches i.e "name: Trust" and "not name: Trust"
returned INCORRECT docs count with **max_expansion: 50**
	
Case 4:	PASSED (if we reindex existing data with field type text to a newer index with field type search_as_you_type)
	Reindexed:
				{
				    "source": {
				        "index": "match-phase-prefix-test-v7-text-10K"
				    },
				    "dest": {
				        "index": "match-phase-prefix-test-v7-reindex-search-as-you-type-10k"
				    }
				}
	Reindexed Doc Count: 10K
	
	Result: both searches i.e "name: Trust" and "not name: Trust"
returned CORRECT docs count with **max_expansion: 5000**

Conclusion:

Above searches return CORRECT results for the newly created index with newly inserted documents(1M~ with field name type: "search_as_you_type" and max_expansion=50).
But,searches return INCORRECT results when I reindex the existing index(with field type: "text") to newer index(with field type: "search_as_you_type") with max_expansion=50.

Please comment your views on this!!

Thanks,
Animesh

Ok. One point of clarification - can you give an example of the full name in a document you don’t want to match with the “not trust” example?
I just want to check you’re not trying to match whole words in which case the prefix matching is all irrelevant and you should just be using plain “match” queries.

1 Like
{
                "_index": "match-phase-prefix-test-v7-search-as-you-type-10",
                "_type": "_doc",
                "_id": "1",
                "_score": null,
                "_source": {
                    "empId": 1,
                    "subCategory": "PRODUCT-2",
                    "address": "206.119.194.10",
                    "created": "2021-12-14T06:09:37+0000",
                    "dbId": 1,
                    "name": "Trust_11_29_2021_DDDD_EEEE_FFFF-1",
                    "active": false,
                    "category": "ENG",
                    "salary": 1.3774075508117676
                },
                "sort": [
                    1639462177000
                ]
            }

Please refer above example..I am just incrementing the last number in field value from
Trust_11_29_2021_DDDD_EEEE_FFFF-1 to Trust_11_29_2021_DDDD_EEEE_FFFF-N

In our actual data, we have similar values for e.g. "Trust_Event", "Trust_Status" etc.
Is it possible that it can be an issue with a certain set of data only?

Thanks for the examples. The text you provided doesn’t look like typical text so the default Analyzer which is designed for everyday English will likely not have tokenised the string into useful words (it normally relies on whitespace and punctuation to break strings into words).
Maybe using a different analyzer configuration would help break this string into words that are more easy and quick to search. For instance, if you also tokenised on the - character the word “trust” would be something you could match directly without needing prefix searches. Much depends on how you want to search in general. The _analyze api should help you debug and select the most appropriate analyzer for this particular content.

1 Like

Hi @Mark_Harwood,
We are not using a default analyzer, we are using a customized analyzer "suffix_match_analyzer" as mentioned in a schema for the name field.

             "analysis" :{
                        "analyzer" :{
                            "suffix_match_analyzer" :{
                                "filter" :["reverse", "lowercase"],
                                "type" :"custom",
                                "tokenizer" :"keyword"
                            }
                        }
                    }

Could you please suggest if we need to update this analyzer? we have both types of values in field name,

  1. Values with UnderScore(_)
  2. Values without UnderScore

Only you can determine if the tokens placed in your index are useful to the range of searches you want to do.
Generally speaking, if the tokens are too short you might need to combine multiple search tokens using phrase or interval queries.
If the tokens are too long you may need to run expensive prefix or infix matches.
I don’t know what queries you want to run other than the “trust” example which is why this is an exercise for you to consider.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.