Edge ngram with phrase matching

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".

For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.

What am I doing wrong?g?

My query:
{ "query":{ "multi_match":{ "query":"dementia in alz", "type":"phrase", "analyzer":"edge_ngram_analyzer", "fields":["_all"] } } }

My mappings:

...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

Analysis of the "dementia in alzheimer" phrase:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}

Phrase match is for matching a sequence of tokens with consecutive "start offsets". For example: "dementia" at 0, "in" at 1, and "alz" at 2.

If you look at the results of your analysis, edge ngram tokenising has resulted in tokens that all have a start offset of zero. One of those tokens is "dementia in alz", but it is a single token.

Try a query analyser consisting of the keyword tokenizer and lowercase token filter, and use a simple match query instead of a phrase match. Then "Dementia in Alz" will hopefully be transformed into a single token "dementia in alz", which in turn will match the corresponding edge ngram.

1 Like

@davidbkemp I added the following to the existing mappings:

     "analyzer": {
        ...
        "keyword_analyzer": {
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
     }

And changed query to
{'query': {'multi_match': {'query': 'dementia in alz', 'analyzer': 'keyword_analyzer', 'fields': ['_all']}}}

No results had been returned.

At this point I'd go poke around the _analyze API and compare the tokens that the analyzer your query uses vs the one you've put on _all.

BTW, if you are only querying one field, I'd go with match instead of multi_match. multi_match is a fairly large thing and match is much easier to read and debug.

I just noticed that you are querying the "_all" field. This probably won't work as it will contain all the fields concatenated together into "one big string"
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

Guys, thank you very mach for your responses. @davidbkemp you are right in your first response.
You can see the solution on Stackoverflow

Guys, finally the correct solution had been found yesterday. Please look at my answer on StackOverflow.