Composite aggregation ordered by time over entire data in database

Hello,
I have a requirement to return aggregations over multiple fields, and present in UI the returned aggregations order by time descending.
I wrote the query below which seems to work.
The only problem I noticed is that sorting is performed over the returned buckets only, rather than over the entire data I have in database.
I would like to always get the newest bucket of what I have stored in database, regardless the page size limit I use.
Is there an option to do that?

Here is my query:

    # composite aggregation, multi fields
    GET events-8e037d0a-b0ef-4316-8ccc-8ec24e96c053-2020.03/_search
    {
      "size": 0,
      "aggs": {
        "aggregated_events": {
          "composite": {
            "size": 100, 
            "sources": [
              {
                "type": {
                  "terms": {
                    "field": "actionInfo.eventType"
                  }
                }
              },
              {
                "publisher": {
                  "terms": {
                    "field": "fileInfo.publisher.keyword",
                    "missing_bucket": true
                  }
                }
              },
              {
                "hash": {
                  "terms": {
                    "field": "fileInfo.sha1.keyword",
                    "missing_bucket": true
                  }
                }
              }
            ]
          },
          "aggregations": {
              "max_timestamp": {
                  "max": { "field": "@timestamp" }
            },
            "events_bucket_sort": {
              "bucket_sort": {
                "sort": [
                  {
                    "max_timestamp": {
                      "order": "desc"
                    }
                  }
                ]
              }
            },
            "last_event": {
                "top_hits": {
                  "sort": [
                    {
                      "@timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "fileInfo.fileName",
                      "actionInfo.user"
                    ]
                  },
                  "size": 1
                }
              }
          }
        }
      }
    }

Thank you,
Ori.

There's a stock set of questions that normally get asked to lead you to the right answer.

These are available if you run this wizard.

1 Like

Hi Mark,
Thank you for your answer.
The wizard does not solve my problem.
When trying the wizard, it eventually lead me to "Transforming data" - https://www.elastic.co/guide/en/elasticsearch/reference/7.6/transforms.html

However, this will not work for me because I need to perform the aggregation with a filter query (not all events in database are participating in aggregation), Plus, this is a beta feature not officially released.

I still wonder - is there no way to sort the aggregations before they are returned by the query?
I searched all over, and read all the documentation and forums and still could not find the way to do that.

Appreciate any help on that,
Ori.

You can supply a query as part of the transform
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/put-transform.html#put-transform-request-body

Hi Mark,

My filter is dynamic and changing each time I calculate the aggregations because it is coming from UI (a user is selecting some filters and press "search").

  1. Do you suggest to perform a "one-time" transform for every UI request to view the aggregated data (by leaving optional 'frequency' and 'sync' parameters empty)?

  2. Is that the only way in Elasticsearch to sort Composite aggregations (or term aggregations) by the 'max_timestamp' as appears in my original query in this thread?

  3. If doing such a one-time transform, is the transform API going to be faster than doing the same by my application?
    Regardless, I fear that iterating over the entire source index and build the aggregated index from it would take more than few seconds. And this will not be a good web user experience.

Thank you,
Ori.

Hi again, Mark.
So I think I found a solution to my issue, by using term aggregations and order instruction.

GET /events-8e037d0a-b0ef-4316-8ccc-8ec24e96c053-2020.03/_search
{
  "query": {
    "term": {
      "fileInfo.targetType.keyword": {
        "value": "VFPT_EXE"
      }
    }
  },
  "size": 0,
  "aggs": {
    "types": {
      "terms": {
        "field": "fileInfo.sha1.keyword",
        "size": 13,
        "**order**": {
          "latestTimestamp": "desc"
        }
      },
      "aggs": {
        "latestTimestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "last_event": {
          "top_hits": {
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "@timestamp",
                "actionInfo.user"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

The only problem here is the limitation to a single aggregated field.
Adding another field to the aggregation creates a hierarchy of buckets and therefore I cannot sort by timestamp.
But - if I need to aggregate by 2 fields f1+f2, I guess I can just create another field named "aggregateBy" during indexing, which is a combination of f1+f2. Then, perform aggregation by the "aggregatedBy" field.
Does this sound like a good approach?

Thanks,
Ori.

Hi,
You can use scripts in a terms aggregation to assemble mulitiple fields into a single string at query time.

Thanks Mark, I will look into it.
What about performance in that case? Is using script going to affect performance?
In this aspect, is it better to prepare the aggregation field while indexing?

Ori.

Scripts will likely be slower. Try benchmark it - it may not be an issue.

Thank you, I did.
Indeed it seems to be slower with a script.
I'd stick to create an "aggregateBy" field in my index.

Thanks for the assistance, seems I now have a solution.

Ori.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.