Group and keep only first result of that group

Is there any way to do a search, do a group (and discard all but one result of that group)?

We have a simple parent->children relationship (building also a group). Our problem is now that we have a complex filter and its often the case that the parent is not within the filter criteria but one or more of the children are. So we want to ensure we display one result per Group even if the parent is not within the filter (the top sorted child should be used then).

My strongly simplefied dataset looks like this:

{ id: "1", groupId: "a", groupSort: 1, date: "2016/01/02" },
{ id: "2", groupId: "a", groupSort: 2, date: "2016/01/02" },
{ id: "3", groupId: "b", groupSort: 1, date: "2016/01/01" },
{ id: "4", groupId: "c", groupSort: 1, date: "2016/01/03" },
{ id: "5", groupId: "c", groupSort: 2, date: "2016/01/02" }

What I want to get is this:

{ id: "3", groupId: "b", groupSort: 1, date: "2016/01/01" },
{ id: "1", groupId: "a", groupSort: 1, date: "2016/01/02" },
{ id: "4", groupId: "c", groupSort: 1, date: "2016/01/03" }

What means:
Group results by "groupId" with sort "groupSort" ASC and only keep the first result. Then sort the remaining items by "date".

The resultset should be pageabe.

Or is there any other way to solve our problem (one result per group)?

PS: I thought about creating a document per group and then nest each member of that group within. But our documents are pretty huge and pretty complex its possible that a group can have 5000+ members. So I fear the performance if I do so.

For grouping you need aggregations, e.g. terms on groupId, your tiebreaker could be bucket_sort, however:

For paging aggregation results you need a composite aggregation. Unfortunately bucket_sort is not supported in a composite aggregation, because it is a pipeline aggregation.

But you don't need sort anyway, because you only want the top 1. This is doable with a painless script executed in a scripted_metric aggregation. We have a similar example in the docs

The example brings me to the next suggestion, if you want to execute this query regularly, you should consider using transforms. With Transforms you can create a self-updating index that always contains the latest state of your grouping.

Hope that helps!

Hi Hendrik thanks for your answer.

I am not sure yet if they help. I want the full documents not, the groupId or the top date.

Here is my test dataset

    PUT test
    {
      "settings": {
        "number_of_shards": 1
      },
      "mappings": {
        "properties": {
          "id": { "type": "keyword" },
          "groupId": { "type": "keyword" },
          "groupSort": { "type": "integer" },
          "date": { "type": "date" }
        }
      }
    }

    POST test/_doc/
    { "id": "1", "groupId": "a", "groupSort": 1, "date": "2020-01-01T09:05:00.7318105" }
    POST test/_doc/
    { "id": "2", "groupId": "a", "groupSort": 2, "date": "2020-01-02T09:05:00.7318105" }
    POST test/_doc/
    { "id": "3", "groupId": "b", "groupSort": 1, "date": "2020-01-01T09:05:00.7318105" }
    POST test/_doc/
    { "id": "4", "groupId": "c", "groupSort": 1, "date": "2020-01-03T09:05:00.7318105" }
    POST test/_doc/
    { "id": "5", "groupId": "c", "groupSort": 2, "date": "2020-01-02T09:05:00.7318105" }

Btw: Our dataset has 30mio documents and a filter could span at max 2mio documents where we show the first 30 and allow to page the fist few pages.

The provided example from the docs returns full documents. The only missing piece is your additional sort criteria. E.g. this is what I get:

        {
          "key" : {
            "group_id" : "b"
          },
          "doc_count" : 1,
          "latest_doc" : {
            "value" : {
              "date" : "2020-01-01T09:05:00.7318105",
              "id" : "3",
              "groupSort" : 1,
              "groupId" : "b"
            }
          }
        },

The document is nested under latest_doc.value. Here is the full example query:

GET test/_search
{
  "size": 0,
  "aggs": {
    "c": {
      "composite": {
        "sources": [
          {
            "group_id": {
              "terms": {
                "field": "groupId"
              }
            }
          }
        ]
      },
      "aggs": {
        "latest_doc": {
          "scripted_metric": {
            "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
            "map_script": """ 
        def current_date = doc['date'].getValue().toInstant().toEpochMilli();
        if (current_date > state.timestamp_latest)
        {state.timestamp_latest = current_date;
        state.last_doc = new HashMap(params['_source']);}
      """,
            "combine_script": "return state",
            "reduce_script": """ 
        def last_doc = '';
        def timestamp_latest = 0L;
        for (s in states) {if (s.timestamp_latest > (timestamp_latest))
        {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}}
        return last_doc
      """
          }
        }
      }
    }
  }
}

With an additional member best_sort_criteria and some extra if, you can implement your additional criteria.

Does not sound like a problem to me, I would start with the query. If you run into problems, e.g. because you run this query a lot (expensive despite caching) you can consider using transform.