Hi,
I'm using aggregations over individual shards - these are large (40M docs per shard) but made of small documents. The aggregations are pretty simple
GET status/_search?preference=_shards:5
{
"from": 0,
"size": 0,
"query": {
"range": {
"nextFetchDate": {
"from": null,
"to": "2018-10-10T12:32:33+01:00",
"include_lower": true,
"include_upper": true,
"boost": 1.0
}
}
},
"explain": false,
"track_total_hits": false,
"aggregations": {
"sample": {
"diversified_sampler": {
"field": "hostname",
"shard_size": 5000,
"max_docs_per_value": 2
},
"aggregations": {
"partition": {
"terms": {
"field": "hostname",
"size": 2500,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [{
"top_hit": "asc"
}, {
"_key": "asc"
}]
},
"aggregations": {
"docs": {
"top_hits": {
"from": 0,
"size": 2,
"version": false,
"explain": false,
"sort": [{
"nextFetchDate": {
"order": "asc"
}
}]
}
},
"top_hit": {
"min": {
"field": "nextFetchDate"
}
}
}
}
}
}
}
}
i.e. 2500 buckets containing up to 2 docs. The performance is constant regardless of the number of buckets or docs within buckets, which makes sense given that most of the time is probably spent iterating over all the matching documents anyway.
When I compare the performance of such an aggregation query with a similar query using FieldCollapsing, the latter is a lot slower and needs more memory. It is even slower than using the aggregation without sampling.
Reading the PR and documentation, I was under the impression that Field Collapsing would be faster and cheaper than aggregations but empirical evidence shows otherwise. Am I missing something? Or is it that this is not the right use case for Field Collapsing?
Thanks!