Woefully slow query; how best to optimise this?


(Ryan Berry) #1

Hi there,

Hoping for some guidance on how to optimise the following query. It's the slowest running query in our app. This is run against some very large indices, one in the region of 1.5B docs. More often than not this exceeds 30 seconds and therefore the Ruby app running this returns a timeout.

The intended result is what we call 'Headline Metrics', it's a current period vs previous period comparison, split by vendor for any number of isrcs (audio track identifiers). See image below:

The query itself is as follows, i'm pretty certain the repetition in the terms filters can be avoided, but would really appreciate some general help here.

{
	"query": {
		"bool": {
			"must": [{
				"terms": {
					"vendor": ["amazon", "itunes", "spotify", "google"]
				}
			}, {
				"terms": {
					"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
				}
			}]
		}
	},
	"size": 0,
	"aggs": {
		"headline_metrics": {
			"filters": {
				"filters": {
					"streams": {
						"bool": {
							"must": [{
								"range": {
									"date": {
										"gte": "2016-06-05",
										"lte": "2016-12-06"
									}
								}
							}, {
								"terms": {
									"vendor": ["amazon", "itunes", "spotify", "google"]
								}
							}, {
								"terms": {
									"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
								}
							}]
						}
					}
				}
			},
			"aggs": {
				"data_split": {
					"terms": {
						"field": "isrc"
					},
					"aggs": {
						"total_streams": {
							"sum": {
								"field": "units"
							}
						},
						"date_periods": {
							"date_range": {
								"field": "date",
								"format": "YYYY-MM-dd",
								"ranges": [{
									"from": "2016-09-05",
									"to": "2016-12-06"
								}, {
									"from": "2016-06-05",
									"to": "2016-09-05"
								}]
							},
							"aggs": {
								"unit_sum": {
									"sum": {
										"field": "units"
									}
								},
								"vendor_split": {
									"terms": {
										"field": "vendor",
										"size": 5,
										"order": {
											"one": "desc"
										}
									},
									"aggs": {
										"one": {
											"sum": {
												"field": "units"
											}
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

Thanks in advance!


(Alexander Reelsen) #2

Hey,

have you left out the aggs part and compared execution speeds? Is the aggregation the slow culprit? Can you remove nested aggregations step-by-step and pinpoint this to one specific aggregation?

How many documents and shards are hit?

Are you reusing the same values for aggregations over and over again, so that shard request caching makes sense?

--Alex


(Ryan Berry) #3

Hey,

That's a good call about taking out the aggs and step by step comparing speeds. I will try that now.

Documents hit will of course vary depending on date range queried, but one six month example I just ran hits approx 140k docs over 5 shards.

Yes, values will be re-used a lot, but the user has flexibility over date range, vendors and isrcs so it's not like it's 100% always going to be the same. I can see caching is working though as subsequent re-runs of the same query are almost instant. It's that initial query that's slow.

Since posting the query above, i've modified to the following, where i've removed the repeated terms filters and moved the overarching date range up into the initial query. This has sped things up a little, but i'd like to get it going even faster :slight_smile:

{
	"query": {
		"bool": {
			"must": [
			  {
			    "range": {
				    "date": {
					    "gte": "2016-06-05",
						  "lte": "2016-12-06"
					  }
				  }
			  },
			  {
				  "terms": {
					  "vendor": ["amazon", "itunes", "spotify", "google"]
				  }
			  }, 
			  {
				  "terms": {
					"isrc": ["GBCEL1200424", "GBCEL1200425", "GBCEL1200310", "GBCEL1200426", "GBCEL1200427"]
				  }
			  }
		  ]
		}
	},
	"size": 0,
	"aggs": {
		"headline_metrics": {
			"filters": {
				"filters": {
					"streams": {
						"bool": {
							"must": []
						}
					}
				}
			},
			"aggs": {
				"data_split": {
					"terms": {
						"field": "isrc"
					},
					"aggs": {
						"total_streams": {
							"sum": {
								"field": "units"
							}
						},
						"date_periods": {
							"date_range": {
								"field": "date",
								"format": "YYYY-MM-dd",
								"ranges": [{
									"from": "2016-09-05",
									"to": "2016-12-06"
								}, {
									"from": "2016-06-05",
									"to": "2016-09-05"
								}]
							},
							"aggs": {
								"unit_sum": {
									"sum": {
										"field": "units"
									}
								},
								"vendor_split": {
									"terms": {
										"field": "vendor",
										"size": 5,
										"order": {
											"one": "desc"
										}
									},
									"aggs": {
										"one": {
											"sum": {
												"field": "units"
											}
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

(Ryan Berry) #4

@spinscale to update you; I just tried removing the aggs as you suggested and it's clear they are slowing things down. Avg. 2000ms without aggs., 20,000 with aggs!


(Alexander Reelsen) #5

Hey,

I completely forgot, but Elasticsearch 5.0 has support for aggregations in the profile API

--Alex


(Ryan Berry) #6

Oh cool, that will be a big help! I've only just upgraded to Elastic 5.

Adding profile: true to the query actually produces an error relating to the aggs. The weird thing is it runs fine without!?

  "error": {
    "root_cause": [
      {
        "type": "aggregation_execution_exception",
        "reason": "Invalid terms aggregation order path [one]. Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."
      }
    ],

(Ryan Berry) #7

@spinscale just a thought; as this query only really uses three fields/attributes from the documents. I wonder if something analogous to a SQL select date, vendor,isrc from rather than select * as it's effectively doing would lighten the load here?

Fyi - I picked a much bigger example, 100M hits, and it's still very slow :frowning:

{
  "took": 71211,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 100064664,
    "max_score": 0,
    "hits": []
  },

(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.