Hey.
I have a bunch of documents that set an attribute - "unique_hash". There will be duplicates based on that hash, and I'm looking to return only the first results for a given value of "unique_hash", but one for every value of "unique_hash".
I've started out down the road of a "terms" aggregation with a "top_hits" aggregation, something like this:
"agg": {
"my_docs": {
"terms": {
"field": "unique_hash.keyword"
},
"aggs": {
"my_docs": {
"top_hits": {
"sort": [
{
"created_at": "desc"
}
],
"size": 1
}
}
}
}
}
}
Which I believe to be correct, and seems to behave on tiny data sets in development. On my production env, however, in order to return anything approaching the full set of buckets for my unique value of terms, I need to set a high "size" value in the outer aggregation. When I do this, my http client times out before the query concludes, and I've seen Elasticsearch exit with a "127" exit code.
The "unique_hash" value is definitely high cardinality, so I guess that is my problem?
Is there any way to get around this? I tried using partitions:
but this didn't appear to change the outcome.