Filter buckets after top_hits aggregation

Hi,

I have an index with documents that only have 3 fields: id, timestamp and status.

I want to retrieve the newest document for each id, but: if that document's status is equals to "something", then I would like to ignore that bucket completely.

What's the best way to do this?

I was able to retrieve the newest document for each id with the query below but I don't know how to filter buckets based on the document's status.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "group_by_order_id": {
      "terms": {
        "field": "id.keyword"
      },
      "aggs": {
        "top_group_hits": {
          "top_hits": {
            "sort": [
              {
                "timestamp": {
                  "order": "desc"
                }
              }
            ],
            "size": 1
          }
        }
      }
    }
  }
}

Thanks,
Alex.

Bump

Use a query other than "match_all". You can use a bool query with a must_not expression and put a "match" expression ion there for the status you want to ignore.

Hi Mark,

I don't think that is going to work.

Suppose I have 2 events:

id: 1, status: A, timestamp: 2020-05-05
id: 1, status: B, timestamp: 2020-04-04

If I filter out status A in the query, I'd get a bucket with the second event (status B).
What I really want is to filter out the bucket if the newest document (in this example, the first event) contains status A. So, the result of my query would be 0 buckets or a single empty bucket (with no hits).

Ah gotcha. Sounds like a “last known status” problem for which an entity centric index is best. For that see the transform api.

Thank you for the tip Mark :slightly_smiling_face:! Will look into that API.

Try with bucket selector https://www.elastic.co/guide/en/elasticsearch/reference/7.6/search-aggregations-pipeline-bucket-selector-aggregation.html

One of the issues with trimming via bucket selectors is that after trimming you may find you have no buckets left at all and have to go back and ask for more data with more searches. It can be a workable solution but depends on the data and the worst case scenario is very inefficient.

Hi Mark,

I tried to use the transform API but it doesn't seem to support top_hits aggregation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/put-transform.html

How would you do it?

Hi,

please have a look at this painless example (This should also work in older versions).

For filtering out complete buckets/documents I suggest to use a drop processor that runs after the pivot. This can be done with an ingest pipeline, which you can specify as part of the transform destination.

Hi Hendrik,

I followed the painless example and I was able to transform the index into a "last document" index grouped by id.

As my source index is prefixed by the date the doc was put (sample indexes: docs-2020-05-25, docs-2020-04-24, ...), can I do the same for the index generated by the transform API?
I would like to generate indexes like: latest-doc-by-id-2020-05-25, latest-doc-by-id-2020-05-24, ...

Thanks,
Alex

Great that you found a solution, regarding your follow up question:

I am thinking of ingest again and using the set processor to set _index: https://www.elastic.co/guide/en/elasticsearch/reference/current/accessing-data-in-pipelines.html#accessing-metadata-fields.

The name for this index could be based on a field you create with transform, e.g. as part of the script.