Run aggregation on collapsed result

Hello,
I have an issue that, first I need to collapse data by a field after sorting by another field then I need to perform some complex aggregation on that output. How to do that ? can anyone help ?

Hi @Nowrin_Hossain
Maybe collapse can help you.

1 Like

Not found what I wanted.

It'd help if you shared why that was not what you needed.

Mainly I wanted some resource or any kind of help that let me know to do aggregation over collapsed data. I didn't find any document about it or any kind of hint how to do this. Thank you.

The doc linked by @RabBit_BR says collapsing does not affect aggregations. There seems to be no way to aggregate on collapsed results.

Is there any reason why you use aggregation on collapsed result and do not use some sub-aggregation over terms aggregation? If you explain more detailed final goal, there might be another way to achieve it.

The scenario is I have vast amount of documents in an index. Every document has field 'agency_id', 'data_retrieval_status' and 'uuid'. 'agency_id' is an id, 'data_retreival_status' can be of three values A, B or C. I need to aggregate first using 'agency_id' by terms aggregation, then using 'data_retrieval_status' by terms aggregation and then count unique uuid. The fact is. same uuid can be in bucket A, B and/or C after terms aggregation by 'data_retrieval_status'. But I want to count same uuid only once and according to priority. If I have same uuid in bucket A then I do not want to count in B or C. If it is not in A rather in B and C, I want to count in B not C.

So, my idea was to collapse data first by data_retrieval_status then run needed aggregation over it. But unfortunately there is no way.

To decide the active status for each uuid|agency_id, you need aggregation on uuid|agency_id. Then you need aggregation on the aggregation result. However, there are no function to aggregate ON the result of aggregation in Elasticsearch.I suppose it is difficult to be achieved in a single query.

One sample solution is using transform funciton. The example is as belows.

PUT test_agg_on_collapse
{
  "mappings": {
    "properties": {
      "agency_id": {"type": "keyword"},
      "data_retrieval_status": {"type": "keyword"},
      "uuid": {"type": "keyword"}
    }
  }
}

POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "A",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "B",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "C",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "B",
  "uuid":"uid1"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "C",
  "uuid":"uid1"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_B",
  "data_retrieval_status": "B",
  "uuid":"uid2"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_B",
  "data_retrieval_status": "C",
  "uuid":"uid2"
}

First, add runtime_mapping to enable sorting by data_retrieval_status:

PUT test_agg_on_collapse/_mapping
{
  "runtime": {
    "status": {
      "type": "double",
      "script": {
        "source": "emit(params.d[doc['data_retrieval_status'].value]);",
        "params":{
          "d":{
            "A": 0,
            "B": 1,
            "C": 2
          }
        }
      }
    }
  }
}

Then, the following transform:

GET _transform/_preview
{
  "pivot":{
    "group_by":{
      "a":{
        "terms": {
          "field": "agency_id"
        }
      },
      "u":{
        "terms":{
          "field": "uuid"
        }
      }
          
    },
    "aggs":{
      "s":{
        "top_metrics":{
          "sort": {"status": "asc"},
          "metrics": {"field": "data_retrieval_status"}
        }
      }
    }
  },
  "source":{
    "index": "test_agg_on_collapse"
  }
}

will create dest index which contains some documents as

{
  "a" : "agency_A",
  "s" : {
    "data_retrieval_status" : "A"
  },
  "u" : "uid0"
},
{
  "a" : "agency_A",
  "s" : {
    "data_retrieval_status" : "B"
  },
  "u" : "uid1"
},
{
  "a" : "agency_B",
  "s" : {
    "data_retrieval_status" : "B"
  },
  "u" : "uid2"
}

. Here you can count unique uid for each agency_id|data_retrieval_status terms.

3 Likes

Thank you. This will help me much.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.