Get the top document from each group with sorting and paging

Hi!
I have a collection of entities and I need to group them by RelationId and select entity with biggest Revision in each group. Result should be ordered by "title" and support paging. Is it possible to do this with ElasticSearch?

Data

RelationId | Title | Revision
-------------------------------
1          | Abx   | 1
1          | Zbc   | 2
2          | Text  | 1 

or json

[{
	"relationId": 1,
	"title": "Abx",
	"revision": 1
}, {
	"relationId": 1,
	"title": "Zbc",
	"revision": 2
}, {
	"relationId": 2,
	"title": "Text",
	"revision": 1
}]

Expected result ordered by Title

RelationId | Title | Revision
-------------------------------
2          | Text  | 1 
1          | Zbc   | 2

I tried "collapse", but result is not sorted correctly

{
  "collapse": {
	"field": "relationId",
	"inner_hits": {
	  "name": "most_recent",                  
	  "size": 1,                              
	  "sort": [
		{
		  "revision": "desc"
		}
	  ]     
	}
  },
  "sort": [
	{
	  "title": {
		"order": "asc"
	  }
	}
  ]
}

and got this (from inner_hits). Inner_hits contains correct values but overall result set is not sorted correctly

RelationId | Title | Revision
-------------------------------
1          | Zbc   | 2
2          | Text  | 1 

result

{
  "hits" : {
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "e2de1644-0f56-4d38-a6fc-f6552901ad51",
        "fields" : {
          "relationId" : [
            1
          ]
        },
        "sort" : [
          "Abx"
        ],
        "inner_hits" : {
          "most_recent" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "index",
                  "_type" : "_doc",
                  "_id" : "653b6ef4-bfa3-4137-a3c5-1a43f63195c2",
                  "_score" : null,
                  "_source" : {
                    "id" : "653b6ef4-bfa3-4137-a3c5-1a43f63195c2",
                    "revision" : 2,
                    "title" : "Zbc",
                    "relationId" : 1
                  },
                  "sort" : [
                    2.0
                  ]
                }
              ]
            }
          }
        }
      },
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "0d150de3-95b0-4911-9bf6-623d7ce9997d",
        "_score" : null,
        "fields" : {
          "relationId" : [
            2
          ]
        },
        "sort" : [
          "Text"
        ],
        "inner_hits" : {
          "most_recent" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "index",
                  "_type" : "_doc",
                  "_id" : "0d150de3-95b0-4911-9bf6-623d7ce9997d",
                  "_score" : null,
                  "_source" : {
                    "id" : "0d150de3-95b0-4911-9bf6-623d7ce9997d",
                    "revision" : 1,
                    "title" : "Text",
                    "relationId" : 2
                  },
                  "sort" : [
                    1.0
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }
}

If you group by RelationId, then you can't retrieve one document anymore. You can only ask for things like "how many documents have this RelationId" or some aggregated function - it's the same idea as a GROUP BY in SQL. I think you mean "for each unique RelationId, give me the top hit". For that, use a Terms Aggregation with a Top Hits sub-aggregation. I'd suggest reading:

  1. Structure of aggregations to see how to put the Top Hits inside the Terms
  2. Terms Aggregation to get a list of all unique RelationId-s
  3. Top Hits Aggregation to get the doc with max Revision
  4. Title sort: once you've put Top Hits inside Terms, look at "Ordering the buckets by single value metrics sub-aggregation" on the Terms page for sorting the top-level results based on the Top Hits sub-agg results.
  5. Paging: have a look at the paging principles with Elasticsearch.

Build it up step by step and confirm the result is what you expect, don't try it all at once. The next one you write will be a lot easier. Let us know if you run into a problem!

Thanks for the help!
I've tried aggregations first, but I faced with the following comment in the documentation.

Unfortunately the top_hits aggregator can’t be used in the order option of the terms aggregator yet.

Here's what I did. The result is correct, but it looks like according to this remark I cannot sort it by "Title". And it looks like paging (size\from) is not supported either for the whole result set

"aggs": {
  "relation_aggregation": {
    "terms": {
      "field": "relationId",
      "size": 10
    },
    "aggs": {
      "hits_aggregation": {
        "top_hits": {
          "sort": [
            {
              "revision": {
                "order": "desc"
              }
            }
          ],
          "size": 1
        }
      }
    }
  }
}

In SQL my query look like this

;with cte
AS
(
select [relationId], [title], [revision], ROW_NUMBER() OVER (PARTITION BY relationId ORDER BY revision DESC) as RN From [MyTable]
)
SELECT * from cte WHERE RN = 1 order by title
OFFSET     0 ROWS
FETCH NEXT 10 ROWS ONLY

It seems that issue does not have a solution in ElasticSearch.

What was checked:

Composite aggregation

This allows to paginate results using "search after" approach.
Sorting by "Title" property is not supported. Results can be sorted by term "relationId" only.

"aggs": {
    "my_buckets": {
        "composite": {
            "size": 10,
            "sources": [{
                    "product": {
                        "terms": {
                            "field": "relationId"
                        }
                    }
                }
            ]
        },
        "aggs": {
            "hits": {
                "top_hits": {
                    "size": 1,
                    "sort": [{
                            "revision": {
                                "order": "desc"
                            }
                        }
                    ]
                }
            }
        }
    }
}

Combination of Terms and Top Hits aggregations
No pagination and no sorting by field other than "relationId"

"aggs": {
    "groupByRelationId": {
        "terms": {
            "field": "relationId",
            "size": 10
        },
        "aggs": {
            "topSequenceNumber": {
                "top_hits": {
                    "sort": [{
                            "revision": {
                                "order": "desc"
                            }
                        }
                    ],
                    "size": 1
                }
            }
        }
    }
}

Collapse search results
Support pagination (from \ size) up to 10.000 entities.
Sorting does not work as expected if there are updates to the "title" property in different revisions.

Data

RelationId | Title | Revision
-------------------------------
1          | Abx   | 1
1          | Zbc   | 2
2          | Text  | 1 

Result

RelationId | Title | Revision
-------------------------------
1          | Zbc   | 2
2          | Text  | 1 

But "Text" must be before "Zbc"

"collapse": {
    "field": "relationId",
    "inner_hits": {
        "name": "topRevision",
        "size": 1,
        "sort": [{
                "revision": "desc"
            }
        ]
    }
},
"from": 0,
"size": 10,
"sort": [{
        "title": {
            "order": "asc"
        }
    }
]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.