How to prioritize exact match using nGram tokenizer? Solved, see solution

buka_bidzina · June 4, 2020, 5:30pm

Hello All,
I am wondering, how to rank an exact match higher than the ngram matches.

For instance:
If I search for asus
It scores higher casual than asus or if I search for app it gives me first laptop and than it gives me apple

Settings

{
  "settings": {
    "index":{
      "max_ngram_diff": 20
    },
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type":"custom",
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2 ,
          "max_gram": 20
        }
      }
    }
  }
}

Mapping

{
"properties" : {
  "analyzedName" : {
      "type" : "nested",
      "properties" : {
        "text" : {
          "type" : "text",
          "analyzer" : "my_analyzer"
        }
      }
    }
  }
}

Query

{
    "query":  {
        "nested" : {
            "path" : "analyzedName",
            "query" : {
                "bool" : {
                    "must" : [
                    { "match" : {"analyzedName.text" : "app"} }
                    ]
                }
            }
        }
    }
}

dadoonet · June 4, 2020, 7:30pm

Have a look at this gist. It demos it:

gist.github.com

https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab

search_kibana_console.txt

### REINIT
DELETE user
PUT user
{
  "settings": {
    "number_of_shards": 1
  }, 
  "mappings": {
    "_doc": {
      "properties": {

This file has been truncated. show original

buka_bidzina · June 4, 2020, 8:45pm

hello, can you exactly tell me which query you are talking about? There are many queries...

dadoonet · June 4, 2020, 9:48pm

Sure: https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab#file-search_kibana_console-txt-L362-L457

buka_bidzina · June 5, 2020, 10:33am

Query

GET index/_search
{
   "from":0,
   "query":{
      "bool":{
         "should":[
            {
               "nested":{
                  "path":"analyzedName",
                  "query":{
                     "match":{
                        "analyzedName.text":{
                           "query":"top"
                        }
                     }
                  }
               }
            }
         ]
      }
   },
   "size":10
}

so, when i am searching for the word top , the first doc in hit is Laptop and second item is Top basic young ...

Mark_Harwood · June 5, 2020, 11:07am

You can use multi-fields - map one field with ngrams and the other as whole-terms then search across both.
Scoring matching clauses is normally a the-more--the-merrier approach so this should work out.

buka_bidzina · June 5, 2020, 11:18am

This is what I am going to do...

Going to have settings like this:

PUT index
{
  "settings": {
    "index": {
      "max_ngram_diff": 50
    },
    "analysis": {
      "filter": {
        "custom_shingle": {
          "max_shingle_size": "2",
          "min_shingle_size": "2",
          "output_unigrams": true,
          "type": "shingle"
        },
        "my_char_filter": {
          "pattern": " ",
          "type": "pattern_replace",
          "replacement": ""
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "nGram_tokenizer"
        },
        "bigram_analyzer": {
          "filter": [
            "lowercase",
            "custom_shingle",
            "my_char_filter"
          ],
          "tokenizer": "standard"
        }
      },
      "tokenizer": {
      "nGram_tokenizer": {
        "type": "ngram",
        "min_gram": 3,
        "max_gram": 20,
        "token_chars": [
          "letter",
          "digit"
        ]
      }
    }
    }
  }
}

and mapping

PUT index/_mapping
{
"properties" : {
      "analyzedName" : {
        "type" : "nested",
        "properties" : {
          "text" : {
            "type" : "text",
            "analyzer" : "bigram_analyzer"
          }
        }
      },
      "ngramAnalyzedName" : {
        "type" : "nested",
        "properties" : {
          "text" : {
            "type" : "text",
            "analyzer" : "nGram_analyzer"
          }
        }
      }
    }
}

Search
on search i am going to boos AnalyzedName over NgramAnalyzedName field, what whould you say?

GET index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "analyzedName",
            "query": {
              "match": {
                "analyzedName.text": {
                  "query": "iphone",
                  "boost": 3
                }
              }
            }
            
          }
        },
        {
          "nested": {
            "path": "ngramAnalyzedName",
            "query": {
              "match": {
                "ngramAnalyzedName.text": {
                  "query": "iphone",
                  "boost": 1
                }
              }
            }
            
          }
        }
      ]
    }
  }
}

buka_bidzina · June 5, 2020, 11:56am

Anyone Looking for solution this worked for me

Mark_Harwood · June 5, 2020, 11:58am

Is the nested type necessary here?

buka_bidzina · June 5, 2020, 1:03pm

I Have multiple languages,so yes

buka_bidzina · June 11, 2020, 12:16pm

PUT index
{
  "settings": {
    "index": {
      "max_ngram_diff": 50
    },
    "analysis": {
      "filter": {
        "custom_shingle": {
          "max_shingle_size": "2",
          "min_shingle_size": "2",
          "output_unigrams": true,
          "type": "shingle"
        },
        "my_char_filter": {
          "pattern": " ",
          "type": "pattern_replace",
          "replacement": ""
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "nGram_tokenizer"
        },
        "bigram_analyzer": {
          "filter": [
            "lowercase",
            "custom_shingle",
            "my_char_filter"
          ],
          "tokenizer": "standard"
        }
      },
      "tokenizer": {
      "nGram_tokenizer": {
        "type": "ngram",
        "min_gram": 3,
        "max_gram": 20,
        "token_chars": [
          "letter",
          "digit"
        ]
      }
    }
    }
  }
}

system · July 9, 2020, 12:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Prioritize exact match using nGram Elasticsearch	2	2386	July 6, 2017
Score with ngram filter Elasticsearch	2	343	July 12, 2018
Problem with mapping and matchQuery Elasticsearch	3	643	July 5, 2017
Partial Match vs Exact Match Scoring with Ngrams Elasticsearch	2	7183	July 5, 2017
Ngram indexing and search results quality Elasticsearch	1	322	July 6, 2017

How to prioritize exact match using nGram tokenizer? Solved, see solution

Related topics