Edge ngram with phrase matching

trex · August 9, 2016, 10:12am

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".

For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.

What am I doing wrong?g?

My query:
{ "query":{ "multi_match":{ "query":"dementia in alz", "type":"phrase", "analyzer":"edge_ngram_analyzer", "fields":["_all"] } } }

My mappings:

...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

Analysis of the "dementia in alzheimer" phrase:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}

davidbkemp · August 9, 2016, 12:41pm

Phrase match is for matching a sequence of tokens with consecutive "start offsets". For example: "dementia" at 0, "in" at 1, and "alz" at 2.

If you look at the results of your analysis, edge ngram tokenising has resulted in tokens that all have a start offset of zero. One of those tokens is "dementia in alz", but it is a single token.

Try a query analyser consisting of the keyword tokenizer and lowercase token filter, and use a simple match query instead of a phrase match. Then "Dementia in Alz" will hopefully be transformed into a single token "dementia in alz", which in turn will match the corresponding edge ngram.

trex · August 9, 2016, 2:22pm

@davidbkemp I added the following to the existing mappings:

     "analyzer": {
        ...
        "keyword_analyzer": {
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
     }

And changed query to
{'query': {'multi_match': {'query': 'dementia in alz', 'analyzer': 'keyword_analyzer', 'fields': ['_all']}}}

No results had been returned.

nik9000 · August 9, 2016, 5:11pm

At this point I'd go poke around the _analyze API and compare the tokens that the analyzer your query uses vs the one you've put on _all.

BTW, if you are only querying one field, I'd go with match instead of multi_match. multi_match is a fairly large thing and match is much easier to read and debug.

davidbkemp · August 9, 2016, 10:14pm

I just noticed that you are querying the "_all" field. This probably won't work as it will contain all the fields concatenated together into "one big string"
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

trex · August 10, 2016, 7:29am

Guys, thank you very mach for your responses. @davidbkemp you are right in your first response.
You can see the solution on Stackoverflow

trex · August 11, 2016, 10:38am

Guys, finally the correct solution had been found yesterday. Please look at my answer on StackOverflow.

Topic		Replies	Views
Match_phrase not matching all terms Elasticsearch	6	3887	January 25, 2019
Edge Ngram not working on querying all fields Elasticsearch	1	621	July 4, 2017
Pokémon - match_phrase fails with edge_ngram & asciifolding Elasticsearch	1	229	April 26, 2022
Search by digits doesn't work with edge_ngram Elasticsearch	3	1587	July 3, 2018
Issue with Edge NGram Tokenizer in elastic search Elasticsearch	2	649	January 13, 2017

Edge ngram with phrase matching

Related topics