Partial Match vs Exact Match Scoring with Ngrams


(Michael La Colla) #1

I have a straight forward question where I have incorporated ngram's for partial matchings in 2.x. The implementation works well but the score results aren't working as I hoped. I would like my score results to look something like this:

Ke: .1
Kev: .2
Kevi: .3
Kevin: .4

Instead I am getting the following results where the scoring is the same if there is a match for the field:

Ke: .4
Kev: .4
Kevi: .4
Kevin: .4

Note: I originally asked this question on StackOverflow and the result was that changing from a ngram filter to a ngram tokenizer is a solution for version 1.7.x because it scores partial matches compounded. I am looking for a solution for ES 2.x. Thanks!
link: http://stackoverflow.com/questions/34618680/elasticsearch-scoring-with-ngrams/34625846#34625846

Settings:

 settings: {
    analysis: {
      filter: {
        ngram_filter: {
          type: 'edge_ngram',
          min_gram: 2,
          max_gram: 15
        }
      },
      analyzer: {
        ngram_analyzer: {
          type: 'custom',
          tokenizer: 'standard',
          filter: [
            'lowercase',
            'ngram_filter'
          ]
        }
      }
    }
  }

Mappings:

mappings: [{
          name: 'voter',
          _all: {
                'type': 'string',
                'analyzer': 'ngram_analyzer',
                'search_analyzer': 'standard'
             },
             properties: {
                last: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },
                first: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },

             }

       }]

Query:

GET /user/_search
{
    "query": {
        "match": {
           "_all": {
               "query": "Ke",
               "operator": "and"

           }
        }
    }
}

In addition as a Part B question, when I add fuzziness to the query it increments partial matches scoring as I originally hoped but the exact match actually returns a lower score then partial match. Why is this?

GET /user/_search
{
    "query": {
        "match": {
           "_all": {
               "query": "Ke",
               "operator": "and",
               "fuzziness": 1

           }
        }
    }
}

--Results--
Ke: 20584314
Kev: 0.25537512
Kevi: 0.26172742
Kevin: 0.21479698 <-- WHY?!


(Igor Motov) #2

You are using the standard analyzer for searching. So, you are always searching for a single term. Let's say for simplicity sake we have a single record with the first name "Kevin" and no last name. This record is indexed with the terms "ke", "kev", "kevi" and "kevin". Let's also assume that you are searching for "Kev". After going through analysis process it gets translated into a search for a single term "kev". It will find you record because this single term exists in the record. Because only one term match you will calculate the score based on TF=1 with some IDF X. The exactly same calculation can be performed for all other searches. It doesn't matter if you are searching for "Ke" or "Kevin" you will always get a single matching terms and because you don't have any other records in the index all your terms will have the same IDF and therefore the same score. Now, if you will start adding more names, it will start changing IDF values of some terms. Basically, it will penalize matches on more frequently occurring prefixes and boost matches on less frequently occurring prefixes. This is what you are observing in the example at the end of the page.

What you expected to see is a bigger match to produce a higher score. But this is only possible if you somehow add the length of the term you are looking for into the scoring formula, which you can easily do through custom scoring. You know the length of the term, so you can just add a constant boost to your match query based on the length of the prefix and ignoring the score of the match query completely. Alternatively, you can cause a bigger term to produce more matches by using n_gram as your search analyzer as well. The later solution is not very practical though since for large queries it will produce a lot unnecessary terms and slow down searches.


(system) #3