Composite aggregation ordered by time over entire data in database

oridool · April 8, 2020, 10:59am

Hello,
I have a requirement to return aggregations over multiple fields, and present in UI the returned aggregations order by time descending.
I wrote the query below which seems to work.
The only problem I noticed is that sorting is performed over the returned buckets only, rather than over the entire data I have in database.
I would like to always get the newest bucket of what I have stored in database, regardless the page size limit I use.
Is there an option to do that?

Here is my query:

    # composite aggregation, multi fields
    GET events-8e037d0a-b0ef-4316-8ccc-8ec24e96c053-2020.03/_search
    {
      "size": 0,
      "aggs": {
        "aggregated_events": {
          "composite": {
            "size": 100, 
            "sources": [
              {
                "type": {
                  "terms": {
                    "field": "actionInfo.eventType"
                  }
                }
              },
              {
                "publisher": {
                  "terms": {
                    "field": "fileInfo.publisher.keyword",
                    "missing_bucket": true
                  }
                }
              },
              {
                "hash": {
                  "terms": {
                    "field": "fileInfo.sha1.keyword",
                    "missing_bucket": true
                  }
                }
              }
            ]
          },
          "aggregations": {
              "max_timestamp": {
                  "max": { "field": "@timestamp" }
            },
            "events_bucket_sort": {
              "bucket_sort": {
                "sort": [
                  {
                    "max_timestamp": {
                      "order": "desc"
                    }
                  }
                ]
              }
            },
            "last_event": {
                "top_hits": {
                  "sort": [
                    {
                      "@timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "fileInfo.fileName",
                      "actionInfo.user"
                    ]
                  },
                  "size": 1
                }
              }
          }
        }
      }
    }

Thank you,
Ori.

Mark_Harwood · April 8, 2020, 11:11am

There's a stock set of questions that normally get asked to lead you to the right answer.

These are available if you run this wizard.

oridool · April 12, 2020, 3:22pm

Hi Mark,
Thank you for your answer.
The wizard does not solve my problem.
When trying the wizard, it eventually lead me to "Transforming data" - https://www.elastic.co/guide/en/elasticsearch/reference/7.6/transforms.html

However, this will not work for me because I need to perform the aggregation with a filter query (not all events in database are participating in aggregation), Plus, this is a beta feature not officially released.

I still wonder - is there no way to sort the aggregations before they are returned by the query?
I searched all over, and read all the documentation and forums and still could not find the way to do that.

Appreciate any help on that,
Ori.

Mark_Harwood · April 12, 2020, 5:42pm

You can supply a query as part of the transform

oridool · April 12, 2020, 7:16pm

Hi Mark,

My filter is dynamic and changing each time I calculate the aggregations because it is coming from UI (a user is selecting some filters and press "search").

Do you suggest to perform a "one-time" transform for every UI request to view the aggregated data (by leaving optional 'frequency' and 'sync' parameters empty)?
Is that the only way in Elasticsearch to sort Composite aggregations (or term aggregations) by the 'max_timestamp' as appears in my original query in this thread?
If doing such a one-time transform, is the transform API going to be faster than doing the same by my application?
Regardless, I fear that iterating over the entire source index and build the aggregated index from it would take more than few seconds. And this will not be a good web user experience.

Thank you,
Ori.

oridool · April 12, 2020, 10:26pm

Hi again, Mark.
So I think I found a solution to my issue, by using term aggregations and order instruction.

GET /events-8e037d0a-b0ef-4316-8ccc-8ec24e96c053-2020.03/_search
{
  "query": {
    "term": {
      "fileInfo.targetType.keyword": {
        "value": "VFPT_EXE"
      }
    }
  },
  "size": 0,
  "aggs": {
    "types": {
      "terms": {
        "field": "fileInfo.sha1.keyword",
        "size": 13,
        "**order**": {
          "latestTimestamp": "desc"
        }
      },
      "aggs": {
        "latestTimestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "last_event": {
          "top_hits": {
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "@timestamp",
                "actionInfo.user"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

The only problem here is the limitation to a single aggregated field.
Adding another field to the aggregation creates a hierarchy of buckets and therefore I cannot sort by timestamp.
But - if I need to aggregate by 2 fields f1+f2, I guess I can just create another field named "aggregateBy" during indexing, which is a combination of f1+f2. Then, perform aggregation by the "aggregatedBy" field.
Does this sound like a good approach?

Thanks,
Ori.

Mark_Harwood · April 13, 2020, 7:39am

Hi,
You can use scripts in a terms aggregation to assemble mulitiple fields into a single string at query time.

oridool · April 13, 2020, 10:09am

Thanks Mark, I will look into it.
What about performance in that case? Is using script going to affect performance?
In this aspect, is it better to prepare the aggregation field while indexing?

Ori.

Mark_Harwood · April 13, 2020, 4:11pm

Scripts will likely be slower. Try benchmark it - it may not be an issue.

oridool · April 13, 2020, 4:24pm

Thank you, I did.
Indeed it seems to be slower with a script.
I'd stick to create an "aggregateBy" field in my index.

Thanks for the assistance, seems I now have a solution.

Ori.

system · May 11, 2020, 4:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Composite aggregation query with bucket_sort does not work properly Elasticsearch	1	204	August 28, 2022
Sorting results from composite aggregation Elasticsearch	14	3300	August 3, 2020
Composite Aggregation Sorting Based on Field Which is Not Key Elasticsearch	4	477	October 8, 2020
Sort over Aggregations buckets by text field over already sorted result Elasticsearch	0	8	September 10, 2024
Bucket Sort over Composite aggregation degrading the performance of query - Any better way? Elasticsearch	11	1594	October 22, 2019

Composite aggregation ordered by time over entire data in database

Related topics