Subselect filter for aggragation

Hi,

I have been searching for a way to solve the following problem by performing a query or aggregation instead of looping through the data and filtering the results myself. I haven't found a solution up until now and was wondering if someone could tell me if it is at all possible in Elasticsearch. The use case is as follows:

Filter on 'last' events with latest entry date.

Sometimes events are ingested multiple times for the same timeseries. This happens when there are newer 'versions' of the same data.

We don't want to replace the existing event, but we ingest it with a different entry date, which provides us with the possibility later on also to do comparisons and analysis.

Example:

  • price for 1 april 2022 is 50 on date 1/1/2022, and 60 on date 5/1/2022.

  • price for 2 april 2022 is 10 on date 1/1/2022, and 5 on date 10/2/2022.

We will have 4 events then:

  • event 1
    • value: 50
    • timestamp: 1/4/2022
    • entry date: 1/1/2022
  • event 2
    • value: 60
    • timestamp: 1/4/2022
    • entry date: 5/1/2022
  • event 3
    • value: 10
    • timestamp: 2/4/2022
    • entry date: 1/1/2022
  • event 4
    • value: 5
    • timestamp: 2/4/2022
    • entry date: 10/2/2022

But when retrieving or aggregating the events we want to filter:

  • only the 'last' event with the most recent entry date
    • Would result in:
      • timestamp: 1/4/2022 → value: 60 (entry date: 5/1/2022)
      • timestamp: 2/4/2022 → value: 5 (entry date: 10/2/2022)
  • 'last' before a given reference date (to exclude future events)
    • with reference date 1/2/2022 , would result in:
      • timestamp: 1/4/2022 → value: 60 (entry date: 5/1/2022)
      • timestamp: 2/4/2022 → value: 10 (entry date: 1/1/2022)
    • with reference date 1/3/2022 , would result in:
      • timestamp: 1/4/2022 → value: 60 (entry date: 5/1/2022)
      • timestamp: 2/4/2022 → value: 5 (entry date: 1/1/2022)

So basically I could do a DateHistogram aggregation on timestamp (fixed interval 1 ms) which would group all the events per timestamp. But then somehow I should be able to filter on the entry date within each group to select the event I want and return the value.

Any help is greatly appreciated!

Of course you can do it by sub-aggregation.

Something like:

{
  "size":0,
  "query":{
    "range":{
      "timestamp":{
        "lte": "2022-02-01"
      }
    }
  },
  "aggs":{
    "d":{
      "terms":{
        "field": "entry_date"
      },
      "aggs":{
        "latest":{
          "top_metrics": {
            "metrics": {"field": "price"},
            "sort": [
              {
                "timestamp": {
                  "order": "desc"
                }
              }]
          }
        }
      }
    }
  }
}

Or you may use top_hits aggregation if you need other information from the document.

Thanks, that did the trick, just needed to set the size on the terms aggregation to get all buckets!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.