Rank based on rarity of a field value

pheeria · January 19, 2023, 8:09pm

Hi

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:

"name": "Red T-Shirt"
"store": "Zara"

"name": "Yellow T-Shirt"
"store": "Zara"

"name": "Red T-Shirt"
"store": "Bershka"

"name": "Green T-Shirt"
"store": "Benetton"

I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.

In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.

So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?

Many thanks in advance!

Mark_Harwood1 · January 20, 2023, 7:50am

Showing top results sorted by natural score but with some diversity can be achieved using a ‘top_hits’ aggregation under a diversified sampler aggregation
Pagination may be tricky using this approach though.

pheeria · February 8, 2023, 12:19pm

Sorry for the late reaction and thank you very much for the help!
Do I understand the idea correctly?

{
  "query": {}, // whatever query
  "size": 0, // since we don't use hits
  "aggs": {
    "my_unbiased_sample": {
      "diversified_sampler": {
        "shard_size": 100,
        "field": "store"
      },
      "aggs": {
        "keywords": {
          "top_hits": {
            "_source": {
              "includes": [ "name", "store" ]
            },
            "size": 100
          }
        }
      }
    }
  }
}

This works! I wanted also to ask performance implications of this approach. How much more costly is this in comparison to not doing it, or doing this kind of "diversification" on the backend?

Mark_Harwood1 · February 8, 2023, 3:09pm

Yes, maybe one thing to consider is the max items-per-store you want to see via the max docs per value setting.

It shouldn't be too bad. For matching docs there's a cost in terms of an additional lookup to find the store and there's a small memory overhead to hold the set of best matching doc IDs for each unique store. A lot depends on your queries/data/sharding etc so benchmarking will give you the reliable answer.

system · March 8, 2023, 3:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to do a BucketSort of an aggregation based on score rather than on document count? Elasticsearch	2	337	March 26, 2019
Distribute Search Results across a specific field / property Elasticsearch	2	560	July 6, 2017
Finding unique values of a field from the returned search results Elasticsearch	6	4037	July 5, 2017
Writing a custom_score or custom_filters_score query based on field value frequency Elasticsearch	2	391	July 6, 2017
Why aren't my string _boost values sorting correctly? Elasticsearch	3	414	July 6, 2017

Rank based on rarity of a field value

Related topics