Aggregation : take only one instance of document based on field

I have a data that looks like this

  [
    {
       "created_at": "2020-05-05",
       "candidate_id" : 43,
       "state": "a"
    },
   {
      "created_at": "2020-05-06",
       "candidate_id" : 43,
      "state": "b"
    },
    {
      "created_at": "2020-05-05",
       "candidate_id" : 44,
       "state": "a"
    },
   {
      "created_at": "2020-05-06",
       "candidate_id" : 44,
      "state": "b"
    },
   {
      "created_at": "2020-05-06",
       "candidate_id" : 45,
      "state": "a"
    }
  ]

I want the aggregated count of state such that, only one latest document sorted by created_at and candidate_id is there in response.

Expected response

{
  key: "a",
  doc_count: 1 // instead of 3. as for candidate 43, 44 doc with `state` `b` "created_at": "2020-05-06" so it latest. so state "a" will be ignored
},
{
  key: "b",
  doc_count: 2
}

So what I want is that in query I should only see one document per candidate_id sorted by created_at desc
So that aggregation will work on a data set without state "a" of candidate 43, 44

If i use cardinality with candidate_id and agg on state. then count of a comes as 3.

how to do this? Is there a way to sort with created_at and filter data on one document per candidate_id

This is the closest I could get, but it doesn't fully solve your problem.

POST candidates/_doc/1
{
  "created_at": "2020-05-05T03:37:04.863Z",
  "candidate_id" : 43,
  "state": "a"
}
POST candidates/_doc/2
{
  "created_at": "2020-05-06T03:37:04.863Z",
  "candidate_id" : 43,
  "state": "b"
}
POST candidates/_doc/3
{
  "created_at": "2020-05-05T03:37:04.863Z",
  "candidate_id" : 44,
  "state": "a"
}
POST candidates/_doc/4
{
  "created_at": "2020-05-06T03:37:04.863Z",
  "candidate_id" : 44,
  "state": "b"
}
POST candidates/_doc/5
{
  "created_at": "2020-05-06T03:37:04.863Z",
  "candidate_id" : 45,
  "state": "a"
}
GET candidates/_search
{
  "size": 0,
  "aggs": {
    "candidate": {
      "terms": {
        "field": "candidate_id",
        "size": 100
      },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "created_at": "desc"
              }
            ]
          }
        }
      }
    }
  }
}

Or:

GET candidates/_search
{
  "size": 0,
  "aggs": {
    "candidate": {
      "terms": {
        "field": "candidate_id",
        "size": 100
      },
      "aggs": {
        "latest": {
          "scripted_metric": {
            "init_script": "state.latest_state = null; state.latest_created_at = 0;",
            "map_script": "if(state.latest_created_at < doc.created_at.value.toInstant().toEpochMilli()) { state.latest_created_at = doc.created_at.value.toInstant().toEpochMilli(); state.latest_state = doc['state.keyword'].value; }",
            "combine_script": "return state;",
            "reduce_script": "def latest_created_at = 0; def latest_state = null; for(a in states) { if (a.latest_created_at > latest_created_at) { latest_created_at = a.latest_created_at; latest_state = a.latest_state; } } return latest_state;"
          }
        }
      }
    }
  }
}

The best approach would be to use a Transform job to to make the data set entity centric, then you'll be able to get the different stats using a terms aggregation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.