Boost with factor calculated on query results

Short version: I want to calculate the factor of a field_value_factor based on the max of a certain field in the results of the same query. Is that possible or do I need to do a second query?

Really long version: Okay I want to do something... Unreasonably complicated, probably. I don't expect Elastic to support it but I don't want to undersell it either. I'm currently using a simpler global solution, but I have an idea that might improve results.

Here's the scenario: I have product data in Elastic that vaguely looks like this

{
      "name": "pepsi cola",
      "price": 482,
      "bought":  532
}

Now I want to boost the query based on bought. But to ensure that products that are vastly more popular don't dwarf all others I use a log. I observed the score of my query being somewhere between 5~15 so I want the boost to add a maximum of 4 to the score. I enforce this maximum by dividing 4 by the log of the highest bought. With 10k bought this results in a factor of 1. Making the field_value_factor look like this

"field_value_factor": {
    "field": "bought",
    “modifier”: “log”,
    “factor”: 1,
    "missing": 0
},
"boost mode": "sum"

Now this is great and all. I haven't gotten around to testing it but I'm sure it's just fine. But I have an idea for an improvement I'd like to try out and I don't have a clue how.

The improvement is simple, calculate the factor based on the maximum bought of this query, rather than the global maximum. That would look something kinda like...

"factor": "4 / log(MAX_BOUGHT_OF_THIS_QUERY)"

The question is, is something like this possible at all?

Sorry for the crazy long explanation. Honestly I wouldn't be too disappointed if I can't since I'm not even sure if this would improve my search results.

Unfortunately, no, something like that isn't possible directly out of the box :confused:

When scoring documents, each document is essentially scored in isolation. The document is scored based on the query, the term frequencies in the document and the document frequencies in the shard. Once the score is generated, it moves on to the next document. The scoring basically has no idea what other scores are being generated.

To make it more difficult, the scoring on one shard is completely isolated from any other shard, as they execute in parallel. So the highest score on one shard may actually be lower than the lowest on a different shard.

So there's really no way to divide by the maximum score (or field quantity, like bought) without first traversing over all the documents to find the max, then executing a secondary phase.

Right now ES doesn't support these multi-phases, but it'd be simple for you to fire off a simple aggregation to find the max bought before you start your query (which is, essentially, what ES would have to do anyway). :slight_smile:

I see. I figured it was a bit much.

If the aggregation is fast, it might even be worth doing. The only reason I wanted to do it in a single query is because sending the request and getting the data takes far longer than it takes for ElasticSearch to do almost anything. So doing a second query will double the time of a request. I'm going to try it out anyway.

Thank you very much. :slight_smile:

No problem, happy to help! And yeah, unfortunately the round-trip network time really hurts in instances like this, where the query itself is very fast.

The agg itself should be pretty speedy, so I'd at least give it a shot to see if it helps :slight_smile: