Hello,
I'm currently working on a use case with an extremely high cardinality (couple of million combinations) and I'm having trouble understanding how ES behaves with aggregations.
Here are sample documents:
{
"docid" : 1
"departurecity": "New York",
"arrivalcity": "London",
"passengers": 5
},
{
"docid" : 2
"departurecity": "New York",
"arrivalcity": "London",
"passengers": 8
},
{
"docid" : 3
"departurecity": "Buenos Aires",
"arrivalcity": "Mexico City",
"passengers": 20
}
The goal is to get, out of all possible departure-arrival combinations, the TOP 100 most popular travels; hence by summing "passengers" in the aggregation.
I'm currently using a composite aggregation as follows:
GET passengers-2020*/_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 65000,
"sources": [
{
"departurecity": {
"terms": {
"field": "departurecity",
"missing_bucket": false,
"order": "asc"
}
}
},
{
"arrivalcity": {
"terms": {
"field": "arrivalcity",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"passengers": {
"sum": {
"field": "passengers"
}
},
"sortCriteria": {
"bucket_sort": {
"sort": [
{
"passengers": {
"order": "desc"
}
}
],
"from": 0,
"size": 100,
"gap_policy": "SKIP"
}
}
}
}
}
}
I get some results from this aggregation.
Assume we now have the following documents instead:
{
"docid" : 1
"departure_arrival": "New York~London",
"passengers": 5
},
{
"docid" : 2
"departure_arrival": "New York~London",
"passengers": 8
},
{
"docid" : 3
"departurecity": "Buenos Aires~Mexico City",
"passengers": 20
}
What I did was combine departure and arrival into one single field and test the same composite aggregation but on a single field ("departure_arrival" instead of both "departurecity" and "arrivalcity") as follows:
GET passengers-2020*/_search
{
"track_total_hits": false,
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 65000,
"sources": [
{
"departure_arrival": {
"terms": {
"field": "departure_arrival",
"missing_bucket": false,
"order": "asc"
}
}
}]
},
"aggs": {
"passengers": {
"sum": {
"field": "passengers"
}
},
"sortCriteria": {
"bucket_sort": {
"sort": [
{
"passengers": {
"order": "desc"
}
}
],
"from": 0,
"size": 100,
"gap_policy": "SKIP"
}
}
}
}
}
}
This gave me a completely different result.
Then I also tested a terms aggregation, which also gave me completely different results:
GET passengers-2020*/_search
{
"track_total_hits": true,
"size": 0,
"aggs": {
"my_buckets": {
"terms": {
"field": "departure_arrival"
},
"aggs": {
"passengers": {
"sum": {
"field": "passengers"
}
},
"sortCriteria": {
"bucket_sort": {
"sort": [
{
"passengers": {
"order": "desc"
}
}
],
"from": 0,
"size": 100,
"gap_policy": "SKIP"
}
}
}
}
}
}
The questions:
By combining both fields "departurecity" and "arrivalcity" into "departure_arrival", I was hoping to reduce the query complexity and improve query response time performance. Not only did it take more time to compute, but it also gave me completely different results.
- What is the best way for me to find the TRUE total number of passengers per departure-arrival ?
- Which mapping would be best adapted (separate departure/arrival or combined departure~arrival) for best query performance?
- Which aggregation can I use to get the top N results ?
- Is it possible to get some intuition as to how would the composite aggregation work under the hood in this use case ?
- How does the "size" parameter influence the composite and terms aggregation ?