Get the latest document version and aggregate the results


(Taras Kohut) #1

My index contains a lot of documents, each of them has several versions, for example:

{"doc_id": 13,
"version": 1,
"text": "bar"}

{"doc_id": 13,
"version": 2,
"text": "bar"}

{"doc_id": 13,
"version": 3,
"text": "bar"}

{"doc_id": 14,
"version": 1,
"text": "foo"}

{"doc_id": 14,
"version": 2,
"text": "bar"}

I want to get the last version for each document, and aggregate them (last versions) using terms aggregation.
I've tried to use top hits to retrieve last versions:

{"size" :0,
"aggs" : {
    "doc_id_groups" : {
        "terms" : {
            "field" : "doc_id",
            "size" : "0"
        },
        "aggs" : {
            "docs" : {
                "top_hits" : {
                    "size" : 1,
                    "sort" : {
                        "version" : {
                            "order" : "desc"
                        }
                    }
                },
                "aggs" : {
                  "text_agg" : {
                    "terms" : { "field" : "text" }
                            }
                       }
                  }
        }
    }
}
}

But I can't use text_agg aggregation, because top hits doesn't support sub aggregations.
I'm expecting this response: "buckets": [ { "key": "bar", "doc_count": 2 }]
I guess retrieving ids and then aggregating them would be very heavy operation for the client.
Maybe scripting could help?
Update: I found a very non-flexible workaround. See here: http://stackoverflow.com/a/39788948/4769188
But I'm still looking for better solution.


(system) #2