How to retrieve one document from many grouped by same filed

(Slava G ) #1

I have documents with my internal id field inside of each document and date when this document was indexed, actually all this are different versions of the same document. I want in some query, to bring only one document from all those with same id field that was relevant to that date, so the question how to do that, I didn't find any way how to do that in the elasticsearch.


(Zachary Tong) #2

You could use a terms aggregation to group by the "id", then a top_hits agg to show a single "top" document from each ID. Something like this:

    "aggs": {
        "ids": {
            "terms": {
                "field": "internal_id"
            "aggs": {
                "most_recent_doc": {
                    "top_hits": {
                        "sort": [
                                "last_activity_date": {
                                    "order": "desc"
                        "size" : 1

You can find more details about top_hits here:

(Slava G ) #3

Thanks, I was thinking about too_hits, but there 2 concerns :

  1. Paging - can I use paging here, for example I want to bring page by page of 50 items (after the terms agg)
  2. I have _source disabled and I don't get in the aggs result only the item ID but can't bring all the fields that store = yes and I need them.


(Zachary Tong) #4

Not really, no. You can specify the number of top-hits you want, but you can't page through them. You'll just get back the size you request.

Ah, well, if you have _source disabled top-hits won't be very useful. It simply shows you the source of the top documents in each bucket. You could try to use the _source include/exclude feature, it may be able to load stored fields. If not you could try the fielddata fields feature, or script fields.

Really, the better option is to reindex and save _source this time. It's always handy to have the original source =)

(Slava G ) #5

Thanks for your reply.
Without paging it's a problem, the idea to show to customer page(s) of items (50) but only latest version of each document and without paging it's not possible to allow him to do a paging.
As for _source , the issue here is storage size , it's expensive and the source document anyway stored in another storage, so redundancy here is expensive, this is _source is not enabled.

So, looks I'm stucked here.

(Zachary Tong) #6

Hmm, I'm a bit confused?

Do you want to show the top 50 IDs, and the most recent document for each ID? If this is the case, you can just use the Terms agg, collect all the results, and apply pagination in your application.

Or do you want to show the top 50 IDs, and show a list of documents for each ID, sorted by timestamp? In that case, I would just execute a separate search to get those documents. E.g. use the Terms aggregation to determine your top IDs to show, then execute a separate search for each of your top IDs and sort that search by time descending. This will allow you to paginate.

There's not much we can do about _source if it's gone. You'll have to fetch the document from your other document store. If ES doesn't have it doesn't have it stored.

(Slava G ) #7

Actually, I want to display documents, 50 (as example) on page (there could be many documents) , with ability to jump to random page. So, I want that customer will see only latest of each document, but each document can be indexed number of time, each change in the document will cause to be indexed again, cause I want to index all the versions of this document , but internal id will be always the same, so I want to display the latest document, mean group by this id and to take the document with latest date. But, If I can't do paging , so I can't jump to page 5 (as example) directly, cause no paging :frowning.
I hope it's more clear now :smile


(system) #8