Per bucket scoring in aggregations


(Stefan Henzen) #1

Hi all,

I have documents that are, very simplified, like this:

{ 
  id: 20
  base_quality: 10
  application: [
    { id: 2, quality: 10 }
    { id: 3, quality: 20 }
  ]
}

What I want to do is:

  • Bucket them by application.id
  • Calculate a score for every document in every bucket based on base_quality + application.quality (application.quality where application.id = id of bucket)
  • Get the best scoring document for every bucket

It's easy to bucket documents by application.id and to get the best quality for every bucket:

  {
    query: { match_all: {} }
    aggs: {
      nested1: { 
        nested: { path: 'applications' },
        aggs: {
          terms1: {
            terms: { field: 'applications.id'},
            aggs: { min_price: { 
                min: { script: "doc['quality'].value + _source.base_quality" }
              }
            }
          }
        }
      }
    }
  }

But I want is the document that creates this quality. Is that possible? What I need is something like top hits aggregation, but then with custom scoring. Maybe with a scripted metric aggregation?

Thanks in advance!


(Colin Goodheart-Smithe) #2

Why not use the function_score query in the query section to score the document based on your criteria and then use the top_hits aggregation to get the top doc for each bucket (the top doc will have a score based on your function_score query)?


(Stefan Henzen) #3

@colings86 the problem is that the score of a document can be different in every bucket where the document appears (based on the nested document that caused it to be in that bucket).

You can get a score per nested document in the query, but the combined score for the top-level document is used by the top_hits aggregation.


(Stefan Henzen) #4

Ow and thanks for the response!


(Colin Goodheart-Smithe) #5

Ok, I had missed the nested agg in there. However, you should be able to just use the sort in the top_hits agg to order the documents by ascending quality field since the base_quality will be the same for all the documents in the same bucket?


(Stefan Henzen) #6

I've looked into that. The base_quality can be different for every document, and the application.quality can be different for every nested document. They are bucketed purely on application.id. If I were able to sort them by descending quality + base_quality and then just get the first one that would be great, but I don't know how.

Also, I asumed sort was only performed on the results actually returned by top_hits, and that those were always determined by _score. If it's not, it really is almost exactly what I need, but not quite :grin: .

What I think I need is something like:

query: { match_all: {} },
aggs: {
  nested1: { 
    nested: { path: 'applications' },
    aggs: {
      terms1: {
        terms: { field: 'applications.id'},
        aggs: { best_quality: { 
          scripted_metric: {
             init_script: "_agg['results'] = []",
             map_script: "_agg.results.add([source: _source, score: doc['quality'].value + _source.base_quality])",
             # This is psuedo code, don't know if it can actually be done
             reduce_script: "result = []; for (a in _aggs) { result.add(a.results.sort().first()) }; return result.sort().first()"
          }
        }}
      }
    }
  }
}

But I can't figure out how to do the sorting in the reduce script.


(system) #7