Aggregation over aggregation on another field + top_hits


(Ion Nistor) #1

Hi!

I’m trying to perform an aggregation over a filtered set of documents; the filter specifics, however, require an aggregation to be used (most recent ‘test’ for each ‘applicant’). The top level aggregation would be done on a field of the document, but on a different field than the one that filtering aggregation was performed.

Given the following set of documents:

{ "test": 1, "applicant":1, "score":90, “topic”:”geometry”},
{ "test": 2, "applicant":2, "score":65, “topic”:”physics” },
{ "test": 3, "applicant":2, "score":88, "topic”:”geometry”},
{ "test": 4, "applicant":1, "score":23, "topic”:”english” },
{ "test": 5, "applicant”:3, "score”:50, "topic”:”physics” },
{ "test": 6, "applicant”:3, "score”:77, "topic”:”english” }

We’re interested in finding out how many users have their highest score in each category.

In other words, we want to:

  1. Filter only the highest scoring tests for each user
  2. Group (and count) the results based on topic.

So, for step 1, we should only remain with:

{ "test": 1, "applicant":1, "score":90, “topic”:”geometry” },
{ "test": 3, "applicant":2, "score":88, "topic”:”geometry” },
{ "test": 5, "applicant”:3, "score”:50, "topic”:”physics”  },
{ "test": 6, "applicant”:3, "score”:77, "topic”:”english”  }

and for step 2, group count them by topic:

{“topic”:”geometry” , “count”: 2}
{“topic”:”physics”  , “count”: 1}
{“topic”:”english”  , “count”: 1}

The trouble is, if I use aggregation/top_hits for filtering :

{
  "aggs": {
    “applicants”: {
      "terms": {
        "field": “applicant”,
        "order" : { “highest_score" : "desc" }
      },
      "aggs": {
        “highest_score": { “max”: { "field": "score" }},
        “highest_score_top_hits": {
          "top_hits": {
            "size":1,
            "sort": [{"score": {"order": "desc"}}]
          }
        }
      }
    }
  }
}

I got the first step right (top_hits), but if I add a ‘parent’ aggregation by ‘topic’, the top_hits aggregation will no longer work properly, since ‘applicants’ will be mixed between different ‘topic’ buckets, thus aggregation on the max score will be incorrect.

It looks like the best approach would be to use a query filter before the ‘topic’ aggregation, but I was unable to create such a filter so that it only preserves the highest scoring test for each applicant.

I've also posted this question on stack overflow, but got no ideas yet :slight_smile:

I'd appreciate a hint.

Thanks,
ion


(Xin Wang) #2

I come to stumbled on your question, Not sure I can understand your description about the problem for step 1

{ "test": 1, "applicant":1, "score":90, “topic”:”geometry” },
{ "test": 3, "applicant":2, "score":88, "topic”:”geometry” },
{ "test": 5, "applicant”:3, "score”:50, "topic”:”physics” },
{ "test": 6, "applicant”:3, "score”:77, "topic”:”english” }

Not sure why is applicant 3's physics score 50 there?