How to further filter data by applying conditions to grouped data

Hi, I am researching a way to filter data after grouping. For example, here is the data:

index: domain_user_search
doc:

[
{
"domain_key": "PElvL8li",
"delete_yn": "Y",
"@timestamp": "2024-12-03T05:49:51.000Z"
},
{
"domain_key": "3zit1iG4",
"delete_yn": "N",
"@timestamp": "2024-12-03T05:47:56.000Z"
},
{
"domain_key": "MPfyjQrP",
"delete_yn": "N",
"@timestamp": "2024-12-03T05:47:56.000Z"
},
{
"domain_key": "vBYwrqvO",
"delete_yn": "N",
"@timestamp": "2024-12-03T05:46:56.000Z"
},
{
"domain_key": "PElvL8li",
"delete_yn": "N",
"@timestamp": "2024-12-03T05:45:51.000Z"
},
]

Data like this exists,

GET /domain_user_mapping/_search
{
  "size": 0,
  "aggs": {
    "group_by_domain_key": {
      "terms": {
        "field": "domain_key.keyword",
        "size": 3
      },
      "aggs": {
        "latest_doc": {
          "filter": {
            "match_all": {}
          },
          "aggs": {
            "sorted_docs": {
              "terms": {
                "field": "@timestamp",
                "size": 1,
                "order": {
                  "_key": "desc"
                }
              },
              "aggs": {
                "filtered_doc_content": {
                  "filter": {
                    "term": {
                      "delete_yn.keyword": "N"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

If you search with a query like this:

"aggregations": {
    "group_by_domain_key": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 3,
      "buckets": [
        {
          "key": "PElvL8li",
          "doc_count": 2,
          "latest_doc": {
            "doc_count": 2,
            "sorted_docs": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 1,
              "buckets": [
                {
                  "key": 1733204991000,
                  "key_as_string": "2024-12-03T05:49:51.000Z",
                  "doc_count": 1,
                  "filtered_doc_content": {
                    "doc_count": 0
                  }
                }
              ]
            }
          }
        },
        {
          "key": "3zit1iG4",
          "doc_count": 1,
          "latest_doc": {
            "doc_count": 1,
            "sorted_docs": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": 1733204876000,
                  "key_as_string": "2024-12-03T05:47:56.000Z",
                  "doc_count": 1,
                  "filtered_doc_content": {
                    "doc_count": 1
                  }
                }
              ]
            }
          }
        },
        {
          "key": "MPfyjQrP",
          "doc_count": 1,
          "latest_doc": {
            "doc_count": 1,
            "sorted_docs": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": 1733204876000,
                  "key_as_string": "2024-12-03T05:47:56.000Z",
                  "doc_count": 1,
                  "filtered_doc_content": {
                    "doc_count": 1
                  }
                }
              ]
            }
          }
        }
      ]
    }

As the key is PElvL8li, the data whose delete_yn is Y is the most recent, so the data is included and "PElvL8li","3zit1iG4","MPfyjQrP" is displayed.

What I want is that since delete_yn is Y, it is excluded from the response data and I want to receive 3 data, "3zit1iG4","MPfyjQrP","vBYwrqvO".

The prerequisite is that the data whose delete_yn is Y should be removed in advance and grouped based on the latest time, and the data whose delete_yn is Y should be excluded after grouping based on the latest time.

Is this filtering possible with a query?

Hey @leeheeseok :

Sure, filtering is possible in aggregations. You can just issue a query with them.

In your example, something similar to:

{
  "size": 0,
  "query": {
    "term": {
      "delete_yn": {
        "value": "N"
      }
    }
   }
  "aggs": {
... you aggregations go here
   }
}

Aggregations are always run on the query results, so you can perform the query, check that your results are the ones that you should aggregate on, and then use the aggregation along with the query to perform the aggregations on the results.

Hope that helps!

1 Like