Rank based on rarity of a field value

Hi :vulcan_salute:

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:

"name": "Red T-Shirt"
"store": "Zara"

"name": "Yellow T-Shirt"
"store": "Zara"

"name": "Red T-Shirt"
"store": "Bershka"

"name": "Green T-Shirt"
"store": "Benetton"

I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.

In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.

So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?

Many thanks in advance!

Showing top results sorted by natural score but with some diversity can be achieved using a ‘top_hits’ aggregation under a diversified sampler aggregation
Pagination may be tricky using this approach though.

1 Like

Sorry for the late reaction and thank you very much for the help!
Do I understand the idea correctly?

{
  "query": {}, // whatever query
  "size": 0, // since we don't use hits
  "aggs": {
    "my_unbiased_sample": {
      "diversified_sampler": {
        "shard_size": 100,
        "field": "store"
      },
      "aggs": {
        "keywords": {
          "top_hits": {
            "_source": {
              "includes": [ "name", "store" ]
            },
            "size": 100
          }
        }
      }
    }
  }
}

This works! I wanted also to ask performance implications of this approach. How much more costly is this in comparison to not doing it, or doing this kind of "diversification" on the backend?

Yes, maybe one thing to consider is the max items-per-store you want to see via the max docs per value setting.

It shouldn't be too bad. For matching docs there's a cost in terms of an additional lookup to find the store and there's a small memory overhead to hold the set of best matching doc IDs for each unique store. A lot depends on your queries/data/sharding etc so benchmarking will give you the reliable answer.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.