Match_phrase not matching all terms

Reading from elastic documentation:

the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other.

I have configured my analyzer to use edge_ngram with keyword tokenizer :

{
    	"index": {
    		"number_of_shards": 1,
    		"analysis": {
    			"filter": {
    				"autocomplete_filter": {
    					"type": "edge_ngram",
    					"min_gram": 1,
    					"max_gram": 20
    				}
    			},
    			"analyzer": {
    				"autocomplete": {
    					"type": "custom",
    					"tokenizer": "keyword",
    					"filter": [
    						"lowercase",
    						"autocomplete_filter"
    					]
    				}
    			}
    		}
    	}
    }

Here is the java class that is used for indexing :

@Document(indexName = "myindex", type = "program")
@Getter
@Setter
@Setting(settingPath = "/elasticsearch/settings.json")
public class Program {


    @org.springframework.data.annotation.Id
    private Long instanceId;

    @Field(analyzer = "autocomplete",searchAnalyzer = "autocomplete",type = FieldType.String )
    private String name;
}

if I have the following phrase in document "hello world", the following query will match it :

{
  "match" : {
    "name" : {
      "query" : "ho",
      "type" : "phrase"
    }
  }
}
result : "hello world"

that's not what I expect because not all of the search terms in the document.

my questions :

1- shouldn't I have 2 search terms in the edge_ngram/autocomplete for the query "ho" ? (the terms should be "h" and "ho" respectively. )

2- why does "ho" match "hello world" when all of the terms according to the definition of phrase query didn't match ? ("ho" term shouldn't have match)


update:
just in case that the question is not clear. The match phrase query should analyze the string to list of terms , here it's ho . Now we will have 2 terms as this is edge_ngram with 1 min_gram. The 2 terms are h and ho . according to elasticsearch the document must contain all of the search terms. However hello world has h only and doesn't have ho so why I did get a match here ?

version : elasticsearch-2.x

You are using the edge_ngram token filter. Let's see how your analyzer treats your query string "ho". Assuming your index is called my_index:

GET my_index/_analyze
{
  "text": "ho",
  "analyzer": "autocomplete"
}

The response shows you that the output of your analyzer would be two tokens at position 0:

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "ho",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

What does Elasticsearch do with a query for two tokens at the same position? It treat's the query as an "OR", even if you use a type "phrase". You can see that from the output of the validate API (which shows you the Lucene query that your query was written into):

GET my_index/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

Because both your query and your document have an h at position 0, the document is going to be a hit.

Now, how to solve this? Instead of the edge_ngram token filter, you could use the edge_ngram tokenizer. This tokenizer increments the position of every token it outputs.

So, if you create your index like this instead:

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

You will see that this query is no longer a hit:

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

But for example this one is:

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "he",
        "type": "phrase"
      }
    }
  }
}
2 Likes

@abdon
That was the perfect answer I was looking for, but is there any reason why the position doesn't get incremented ? Also shouldn't be the end offset 1 in this token ?

   {
      "token": "h",
      "start_offset": 0,
      "end_offset": 2,  // here should be 1 right ?
      "type": "word",
      "position": 0
    }

I'm not a Lucene developer, but I think of it this way: a tokenizer creates a stream of tokens. In this stream, every token has its own position. A token filter on the other hand, works on a specific token. That token already has a position.

The end_offset represents the end of the original token in the original text. The original token ho has a length of 2.

@abdon

I was looking on examples about the end_offset and it's not the same as you posted, can you please let me know what I'm missing ?

the next link shows the end offset increases relatively to the word length, and there are many examples so , that's why I thought h should end at 1

It again comes down to whether you use the edge_ngram tokenizer or edge_ngram token filter.

If you use the token filter, you will get the start and end offset of the original token:

GET _analyze
{
  "text": "ho",
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 20
    }
  ]
}

If you use the tokenizer you will get a stream of tokens that each have their own start and end offset:

GET _analyze
{
  "text": "ho",
  "tokenizer": {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 20
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.