Woefully slow query; how best to optimise this?

Ryan_Berry · December 5, 2016, 10:47am

Hi there,

Hoping for some guidance on how to optimise the following query. It's the slowest running query in our app. This is run against some very large indices, one in the region of 1.5B docs. More often than not this exceeds 30 seconds and therefore the Ruby app running this returns a timeout.

The intended result is what we call 'Headline Metrics', it's a current period vs previous period comparison, split by vendor for any number of isrcs (audio track identifiers). See image below:

The query itself is as follows, i'm pretty certain the repetition in the terms filters can be avoided, but would really appreciate some general help here.

{
	"query": {
		"bool": {
			"must": [{
				"terms": {
					"vendor": ["amazon", "itunes", "spotify", "google"]
				}
			}, {
				"terms": {
					"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
				}
			}]
		}
	},
	"size": 0,
	"aggs": {
		"headline_metrics": {
			"filters": {
				"filters": {
					"streams": {
						"bool": {
							"must": [{
								"range": {
									"date": {
										"gte": "2016-06-05",
										"lte": "2016-12-06"
									}
								}
							}, {
								"terms": {
									"vendor": ["amazon", "itunes", "spotify", "google"]
								}
							}, {
								"terms": {
									"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
								}
							}]
						}
					}
				}
			},
			"aggs": {
				"data_split": {
					"terms": {
						"field": "isrc"
					},
					"aggs": {
						"total_streams": {
							"sum": {
								"field": "units"
							}
						},
						"date_periods": {
							"date_range": {
								"field": "date",
								"format": "YYYY-MM-dd",
								"ranges": [{
									"from": "2016-09-05",
									"to": "2016-12-06"
								}, {
									"from": "2016-06-05",
									"to": "2016-09-05"
								}]
							},
							"aggs": {
								"unit_sum": {
									"sum": {
										"field": "units"
									}
								},
								"vendor_split": {
									"terms": {
										"field": "vendor",
										"size": 5,
										"order": {
											"one": "desc"
										}
									},
									"aggs": {
										"one": {
											"sum": {
												"field": "units"
											}
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

Thanks in advance!

spinscale · December 5, 2016, 12:34pm

Hey,

have you left out the aggs part and compared execution speeds? Is the aggregation the slow culprit? Can you remove nested aggregations step-by-step and pinpoint this to one specific aggregation?

How many documents and shards are hit?

Are you reusing the same values for aggregations over and over again, so that shard request caching makes sense?

--Alex

Ryan_Berry · December 5, 2016, 1:34pm

Hey,

That's a good call about taking out the aggs and step by step comparing speeds. I will try that now.

Documents hit will of course vary depending on date range queried, but one six month example I just ran hits approx 140k docs over 5 shards.

Yes, values will be re-used a lot, but the user has flexibility over date range, vendors and isrcs so it's not like it's 100% always going to be the same. I can see caching is working though as subsequent re-runs of the same query are almost instant. It's that initial query that's slow.

Since posting the query above, i've modified to the following, where i've removed the repeated terms filters and moved the overarching date range up into the initial query. This has sped things up a little, but i'd like to get it going even faster

{
	"query": {
		"bool": {
			"must": [
			  {
			    "range": {
				    "date": {
					    "gte": "2016-06-05",
						  "lte": "2016-12-06"
					  }
				  }
			  },
			  {
				  "terms": {
					  "vendor": ["amazon", "itunes", "spotify", "google"]
				  }
			  }, 
			  {
				  "terms": {
					"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
				  }
			  }
		  ]
		}
	},
	"size": 0,
	"aggs": {
		"headline_metrics": {
			"filters": {
				"filters": {
					"streams": {
						"bool": {
							"must": []
						}
					}
				}
			},
			"aggs": {
				"data_split": {
					"terms": {
						"field": "isrc"
					},
					"aggs": {
						"total_streams": {
							"sum": {
								"field": "units"
							}
						},
						"date_periods": {
							"date_range": {
								"field": "date",
								"format": "YYYY-MM-dd",
								"ranges": [{
									"from": "2016-09-05",
									"to": "2016-12-06"
								}, {
									"from": "2016-06-05",
									"to": "2016-09-05"
								}]
							},
							"aggs": {
								"unit_sum": {
									"sum": {
										"field": "units"
									}
								},
								"vendor_split": {
									"terms": {
										"field": "vendor",
										"size": 5,
										"order": {
											"one": "desc"
										}
									},
									"aggs": {
										"one": {
											"sum": {
												"field": "units"
											}
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

Ryan_Berry · December 5, 2016, 1:44pm

@spinscale to update you; I just tried removing the aggs as you suggested and it's clear they are slowing things down. Avg. 2000ms without aggs., 20,000 with aggs!

spinscale · December 5, 2016, 1:46pm

Hey,

I completely forgot, but Elasticsearch 5.0 has support for aggregations in the profile API

--Alex

Ryan_Berry · December 5, 2016, 1:51pm

Oh cool, that will be a big help! I've only just upgraded to Elastic 5.

Adding profile: true to the query actually produces an error relating to the aggs. The weird thing is it runs fine without!?

  "error": {
    "root_cause": [
      {
        "type": "aggregation_execution_exception",
        "reason": "Invalid terms aggregation order path [one]. Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."
      }
    ],

Ryan_Berry · December 5, 2016, 3:35pm

@spinscale just a thought; as this query only really uses three fields/attributes from the documents. I wonder if something analogous to a SQL select date, vendor,isrc from rather than select * as it's effectively doing would lighten the load here?

Fyi - I picked a much bigger example, 100M hits, and it's still very slow

{
  "took": 71211,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 100064664,
    "max_score": 0,
    "hits": []
  },

system · January 2, 2017, 3:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query Optimization Elasticsearch	2	501	November 4, 2020
Troubleshooting Slow Aggregation Query Elasticsearch	7	3555	July 5, 2017
Elastic Search Aggregations Slow Elasticsearch	21	2939	November 26, 2021
Performance tuning for aggregators Elasticsearch	1	327	August 3, 2020
Aggregations taking way too long? Elasticsearch	7	350	May 24, 2022

Woefully slow query; how best to optimise this?

Related topics