Limited score precision for a big score hierarchy

Grisha · August 14, 2023, 2:06pm

Hi, I'm trying to implement a kind of score hierarchy using different boosts for different fields (there are multiple fields and type of search (full match, fuzzy, etc.)).

Simplified example of boosts:

field_1 fuzzy boost - 1
field_1 full match boost - 10
field_2 fuzzy boost - 100
field_2 full match boost - 1000
...

Each level of score hierarchy is 10 times greater than previous.
It's an artificial limit (10 tokens of level 2 will have the same score as 1 token of level 1).

Since scores are positive 32-bit floating point numbers, such score hierarchy can not be implementing due to losing of precision.

Are there any option to implement such score hierarchy?

Thank you, in advance.

UPD. Scores of different levels should be summarized correctly (without losing of precision).

dadoonet · August 14, 2023, 2:32pm

Welcome!

May be not exactly what you are asking for but this could give an idea:

gist.github.com

https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab#file-search_kibana_console-txt-L472-L567

search_kibana_console.txt

### REINIT
DELETE user
PUT user
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "comments": {

This file has been truncated. show original

HTH

Grisha · August 14, 2023, 2:47pm

If I understood correctly, the idea is to use multiple search requests.
In this case, it will be not possible to have a sum of all boosts from different fields and search types.

Sorry, I should have mentioned this in the description.

dadoonet · August 14, 2023, 4:18pm

This is wrong I think.
It will add all the scores.

Grisha · August 15, 2023, 7:27am

Maybe, I didn't get the idea.

As far as I understand:

in case of one request usage there is losing of precision. There are more than 50 score hierarchy levels and each level of score hierarchy is 10 times greater than previous. For instance: 100 000 000 score + 1 score will be 100 000 000.
in case of multiple requests usage it will be not possible to have a sum of all boosts from the requests for the same documents, because there are a lot of documents (from performance point of view)

@dadoonet Could you please correct me, if I am wrong, and explain the idea how to implement such score hierarchy.

dadoonet · August 15, 2023, 9:10am

Why would you want such big numbers as scores?

0.1, 1, 2, 5, 10 are already a good way to have ordered results, no?

What's wrong with the example I gave?

Grisha · August 16, 2023, 6:29am

Why would you want such big numbers as scores?

To keep a strict hierarchy: one token on level 1 (matched by field_1) should have score greater than any number of tokens on level 2 (matched by field_2).

Since I don't know how to implement the strict hierarchy, I use such big numbers as scores.

dadoonet · August 16, 2023, 7:21am

So I think you need to use some custom scripts here. See Function score query | Elasticsearch Guide [8.11] | Elastic

And with a Script, you will probably be able to do what you need: Function score query | Elasticsearch Guide [8.11] | Elastic

Not sure if that will be possible and fast enough though.

What is the use case? I think I have never heard about such a use case and I'm wondering if you are trying to solve a problem the right way...

Grisha · August 16, 2023, 8:24am

And with a Script, you will probably be able to do what you need: Function score query | Elasticsearch Guide [8.9] | Elastic

Not sure if that will be possible and fast enough though.

I thought about a custom script to normalize scores on each level of hierarchy, but I didn't do that due to performance point of view. And normalization will have some precision problems too.

What is the use case? I think I have never heard about such a use case and I'm wondering if you are trying to solve a problem the right way...

I'm trying to implement an autocomplete of addresses.
Each address has multiple fields: country, region, city, street, etc.
Using the score hierarchy, I'm trying to achieve more relevant results.

For instance: one matched token by the street field should have greater score than 10 (ideally any number) matched tokens by the country/city/regions fields (by one of these fields or their combination).

And a correct sum of scores from different hierarchy levels are also important.
For instance: a document with one matched token by the street field and one matched token by the city field should have greater score than a document with one matched token by the street field only.

dadoonet · August 16, 2023, 9:01am

For one similar use case (a demo), I did:

github.com

dadoonet/bano-elastic/blob/master/script.txt#L86C1-L100C2


      
          GET bano-*/_search?track_total_hits=true
          {
            "size": 1, 
            "query": {
              "multi_match": {
                "query": "6 allée des myrtilles cergy",
                "fields": [
                  "address.city",
                  "address.street_name",
                  "address.number"
                ],
                "type": "most_fields"
              }
            }
          }

I'd go for something like this first before trying something more complex.
May be I'd boost the street but in all cases you should not end up with a score of 100 000 000...

Grisha · August 16, 2023, 3:06pm

If I understand this demo correctly, it will not work as expected from hierarchy point of view:
a document with multiple matched tokens by city field will have greater score than a document with one matched token by street_name field.

dadoonet · August 16, 2023, 4:18pm

Could you reproduce the problem you think can happen with some actual data and share that as a Kibana Dev Tools script?

Grisha · September 7, 2023, 7:39am

Hi @dadoonet, sorry for the delay.
I've reproduced the problem.
In the example below both addresses have the same score - 1.0 for search by StreetToken1.
I want the address with StreetToken1 in street name to be first and it should have greater score than the second address with StreetToken1 in city name.

PS. I used boolean similarity for better relevance of results.

Example

# create index
PUT /adress_test
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "street_name": { 
        "type": "text",
        "similarity": "boolean"
      },
      "city_name": { 
        "type": "text",
        "similarity": "boolean"
      }
    }
  }
}

# index addresses
POST /adress_test/_doc
{
  "street_name": "StreetToken1",
  "city_name": "CityToken1"
}

POST /adress_test/_doc
{
  "street_name": "StreetToken2",
  "city_name": "CityToken2 StreetToken1"
}

# search addresses
GET /adress_test/_search?track_total_hits=false
{
  "explain": false,
  "size": 10,
  "query": {
    "multi_match": {
      "query": "StreetToken1",
      "fields": [
        "street_name",
        "city_name"
      ],
      "type": "most_fields"
    }
  }
}

dadoonet · September 7, 2023, 10:19am

I think that with much more volume this will work automatically because StreetToken1 as a city_name will be much more frequent and thus less relevant for the city_name field than for the street_name.

Anyway, if the street_name is more to you, you can do something like this:

GET /adress_test/_search
{
  "query": {
    "multi_match": {
      "query": "StreetToken1",
      "fields": [
        "street_name^3.0",
        "city_name"
      ],
      "type": "most_fields"
    }
  }
}

Does this work for you?

Grisha · September 7, 2023, 3:02pm

I think that with much more volume this will work automatically because StreetToken1 as a city_name will be much more frequent and thus less relevant for the city_name field than for the street_name .

To have more relevant result, I've decided not to use frequency of tokens (that's why I use boolean similarity).

"street_name^3.0",

Unfortunately, this approach is not enough, because there are a lot of fields (~20) and types of search. So it causes losing of score precision. As I mentioned earlier, to have the score hierarchy, each level of score hierarchy should be 10 times greater than previous:

{
"fields": [
 "city_synonym",
 "city_name^10.0",
 "region_synonym^100.0",
 "region_name^1000.0",
 "street_synonym^10000.0",
 "street_name^100000.0",
...
]
}

system · October 5, 2023, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.