Run aggregation on collapsed result

Nowrin_Hossain · June 25, 2022, 11:10pm

Hello,
I have an issue that, first I need to collapse data by a field after sorting by another field then I need to perform some complex aggregation on that output. How to do that ? can anyone help ?

RabBit_BR · June 26, 2022, 10:04pm

Hi @Nowrin_Hossain
Maybe collapse can help you.

Nowrin_Hossain · June 27, 2022, 7:04am

Not found what I wanted.

warkolm · June 28, 2022, 1:08am

It'd help if you shared why that was not what you needed.

Nowrin_Hossain · July 3, 2022, 7:25am

Mainly I wanted some resource or any kind of help that let me know to do aggregation over collapsed data. I didn't find any document about it or any kind of hint how to do this. Thank you.

Tomo_M · July 3, 2022, 10:16am

The doc linked by @RabBit_BR says collapsing does not affect aggregations. There seems to be no way to aggregate on collapsed results.

Is there any reason why you use aggregation on collapsed result and do not use some sub-aggregation over terms aggregation? If you explain more detailed final goal, there might be another way to achieve it.

Nowrin_Hossain · July 3, 2022, 11:00am

The scenario is I have vast amount of documents in an index. Every document has field 'agency_id', 'data_retrieval_status' and 'uuid'. 'agency_id' is an id, 'data_retreival_status' can be of three values A, B or C. I need to aggregate first using 'agency_id' by terms aggregation, then using 'data_retrieval_status' by terms aggregation and then count unique uuid. The fact is. same uuid can be in bucket A, B and/or C after terms aggregation by 'data_retrieval_status'. But I want to count same uuid only once and according to priority. If I have same uuid in bucket A then I do not want to count in B or C. If it is not in A rather in B and C, I want to count in B not C.

So, my idea was to collapse data first by data_retrieval_status then run needed aggregation over it. But unfortunately there is no way.

Tomo_M · July 3, 2022, 2:42pm

To decide the active status for each uuid|agency_id, you need aggregation on uuid|agency_id. Then you need aggregation on the aggregation result. However, there are no function to aggregate ON the result of aggregation in Elasticsearch.I suppose it is difficult to be achieved in a single query.

One sample solution is using transform funciton. The example is as belows.

PUT test_agg_on_collapse
{
  "mappings": {
    "properties": {
      "agency_id": {"type": "keyword"},
      "data_retrieval_status": {"type": "keyword"},
      "uuid": {"type": "keyword"}
    }
  }
}

POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "A",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "B",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "C",
  "uuid":"uid0"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "B",
  "uuid":"uid1"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_A",
  "data_retrieval_status": "C",
  "uuid":"uid1"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_B",
  "data_retrieval_status": "B",
  "uuid":"uid2"
}
POST test_agg_on_collapse/_doc
{
  "agency_id": "agency_B",
  "data_retrieval_status": "C",
  "uuid":"uid2"
}

First, add runtime_mapping to enable sorting by data_retrieval_status:

PUT test_agg_on_collapse/_mapping
{
  "runtime": {
    "status": {
      "type": "double",
      "script": {
        "source": "emit(params.d[doc['data_retrieval_status'].value]);",
        "params":{
          "d":{
            "A": 0,
            "B": 1,
            "C": 2
          }
        }
      }
    }
  }
}

Then, the following transform:

GET _transform/_preview
{
  "pivot":{
    "group_by":{
      "a":{
        "terms": {
          "field": "agency_id"
        }
      },
      "u":{
        "terms":{
          "field": "uuid"
        }
      }
          
    },
    "aggs":{
      "s":{
        "top_metrics":{
          "sort": {"status": "asc"},
          "metrics": {"field": "data_retrieval_status"}
        }
      }
    }
  },
  "source":{
    "index": "test_agg_on_collapse"
  }
}

will create dest index which contains some documents as

{
  "a" : "agency_A",
  "s" : {
    "data_retrieval_status" : "A"
  },
  "u" : "uid0"
},
{
  "a" : "agency_A",
  "s" : {
    "data_retrieval_status" : "B"
  },
  "u" : "uid1"
},
{
  "a" : "agency_B",
  "s" : {
    "data_retrieval_status" : "B"
  },
  "u" : "uid2"
}

. Here you can count unique uid for each agency_id|data_retrieval_status terms.

Nowrin_Hossain · July 4, 2022, 8:08am

Thank you. This will help me much.

system · August 1, 2022, 8:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregates on collapse Elasticsearch	1	636	January 13, 2021
How to sum fieds of collapsed results? Elasticsearch eql-elastic-query-language	1	351	February 17, 2023
Aggregation with collapsed query Elasticsearch	4	2587	May 26, 2022
Get all results in aggregations Elasticsearch	2	589	May 25, 2018
How to apply aggregations to collapsed/grouped results? Elasticsearch	1	508	September 27, 2018

Run aggregation on collapsed result

Related topics