'Intelligently' cutting out results

Hi,
lets say I have products with serial numbers, and format is FOO-XXXXX where FOO is always there and XXXXX are digits.

When I search for FOO-12345, the results are showing all products, for example (score in brackets):

FOO-12345 (3)
FOO-12344 (2.5)
FOO-42353 (0.01)
FOO-XXXX (0.01)

Basically, all numbers where only FOO is being matched returns low score.

Soo basically, this data can be grouped into 2 distinct clusters score. Sadly, I don't know proper terminology, but one cluster is wider and close to 3, and second has very similar score close to 0.01. Is there a way to instruct elasticsearch, that in such case, return only first cluster?
I am happy to do all the reading, as well learn/relearn required math, so all I am asking are good reads you can point me to.

Thanks

Hi Sławosz

Scores are computed on a number of factors, some of which vary over time as more content is added to the index. For this reason we don't suggest reading too much into what the scores mean (i.e. a score of 1 doesn't mean "perfection").

That said, if you want consider the entire range of scores produced by a query and look at their distribution the percentiles aggregation can be used to help draw that curve:

GET /MY_INDEX/_search
{
  "query": {
     ... MY QUERY HERE ...
  },
  "aggs": {
    "scoreDistribution": {
      "percentiles": {
        "script": "_score"        
      }
    }
  }
}

The results in my test query here look like this:

  "aggregations" : {
    "scoreDist" : {
      "values" : {
        "1.0" : 5.383362350463867,
        "5.0" : 6.5666823387146,
        "25.0" : 7.337974548339844,
        "50.0" : 8.046831130981445,
        "75.0" : 9.65035629272461,
        "95.0" : 11.354881286621094,
        "99.0" : 14.382296962738014
      }
    }
  }
1 Like

Thanks Mark,
it is indeed very helpful. Its step in very good direction to potential solution.
I got something like this:

"aggregations" : {
    "scoreDistribution" : {
      "values" : {
        "1.0" : 0.015267470851540565,
        "5.0" : 0.015267470851540565,
        "25.0" : 0.015267470851540565,
        "50.0" : 0.015267470851540565,
        "75.0" : 0.015267470851540565,
        "95.0" : 2.449463472701605,
        "99.0" : 3.1063098907470703
      }
    }
  }

As you can see, most of the results has poor score, and there is huge gap. Could you recommend a method how to detect this gap (I am not afraid of math)?

Perhaps a much simpler approach is to make all query terms required using the AND operator.. In your query example that would be turned into a search for FOO AND 12345 as opposed to the default FOO OR 12345.
The details can depend on which query type and index mapping you are using.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.