Hi, I'm trying to implement a kind of score hierarchy using different boosts for different fields (there are multiple fields and type of search (full match, fuzzy, etc.)).
Simplified example of boosts:
field_1 fuzzy boost - 1
field_1 full match boost - 10
field_2 fuzzy boost - 100
field_2 full match boost - 1000
...
Each level of score hierarchy is 10 times greater than previous.
It's an artificial limit (10 tokens of level 2 will have the same score as 1 token of level 1).
Since scores are positive 32-bit floating point numbers, such score hierarchy can not be implementing due to losing of precision.
Are there any option to implement such score hierarchy?
Thank you, in advance.
UPD. Scores of different levels should be summarized correctly (without losing of precision).
If I understood correctly, the idea is to use multiple search requests.
In this case, it will be not possible to have a sum of all boosts from different fields and search types.
Sorry, I should have mentioned this in the description.
in case of one request usage there is losing of precision. There are more than 50 score hierarchy levels and each level of score hierarchy is 10 times greater than previous. For instance: 100 000 000 score + 1 score will be 100 000 000.
in case of multiple requests usage it will be not possible to have a sum of all boosts from the requests for the same documents, because there are a lot of documents (from performance point of view)
@dadoonet Could you please correct me, if I am wrong, and explain the idea how to implement such score hierarchy.
To keep a strict hierarchy: one token on level 1 (matched by field_1) should have score greater than any number of tokens on level 2 (matched by field_2).
Since I don't know how to implement the strict hierarchy, I use such big numbers as scores.
Not sure if that will be possible and fast enough though.
I thought about a custom script to normalize scores on each level of hierarchy, but I didn't do that due to performance point of view. And normalization will have some precision problems too.
What is the use case? I think I have never heard about such a use case and I'm wondering if you are trying to solve a problem the right way...
I'm trying to implement an autocomplete of addresses.
Each address has multiple fields: country, region, city, street, etc.
Using the score hierarchy, I'm trying to achieve more relevant results.
For instance: one matched token by the street field should have greater score than 10 (ideally any number) matched tokens by the country/city/regions fields (by one of these fields or their combination).
And a correct sum of scores from different hierarchy levels are also important.
For instance: a document with one matched token by the street field and one matched token by the city field should have greater score than a document with one matched token by the street field only.
I'd go for something like this first before trying something more complex.
May be I'd boost the street but in all cases you should not end up with a score of 100 000 000...
If I understand this demo correctly, it will not work as expected from hierarchy point of view:
a document with multiple matched tokens by city field will have greater score than a document with one matched token by street_name field.
Hi @dadoonet, sorry for the delay.
I've reproduced the problem.
In the example below both addresses have the same score - 1.0 for search by StreetToken1.
I want the address with StreetToken1 in street name to be first and it should have greater score than the second address with StreetToken1 in city name.
PS. I used boolean similarity for better relevance of results.
I think that with much more volume this will work automatically because StreetToken1 as a city_name will be much more frequent and thus less relevant for the city_name field than for the street_name.
Anyway, if the street_name is more to you, you can do something like this:
I think that with much more volume this will work automatically because StreetToken1 as a city_name will be much more frequent and thus less relevant for the city_name field than for the street_name .
To have more relevant result, I've decided not to use frequency of tokens (that's why I use boolean similarity).
"street_name^3.0",
Unfortunately, this approach is not enough, because there are a lot of fields (~20) and types of search. So it causes losing of score precision. As I mentioned earlier, to have the score hierarchy, each level of score hierarchy should be 10 times greater than previous:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.