Elasticsearch aggregation encounters circuit_breaking_exception

I am using elasticsearch 6.1.1 and creating a single shard per index. The cluster settings is as follows -

    {
    "persistent": {
        "cluster": {
            "routing": {
                "allocation": {
                    "enable": "all"
                }
            }
        },
        "indices": {
            "breaker": {
                "request": {
                    "limit": "80%"
                }
            }
        }
    },
    "transient": {}
}

System RAM : 16 GB
JVM heap size : 4 GB

The reason of using one shard per index is to avoid data approximation performed by elasticsearch on specific cases.

I have an index 'atcc_summary_201707_5' containing nearly 0.14 million documents(having some nested fields) of size 250 mb. I am trying to run an aggregation query over a subset of those those documents. The query involves nested bucketing(up to 3 levels) and some metric aggregations. Every time I'm running the query, it is throwing circuit_breaking_exception with the following reason -

"[parent] Data too large, data for [<agg [count_2wh]>] would be [2982073327/2.7gb], which is larger than the limit of [2982071500/2.7gb]"

I'm literally stuck here as the whole point of using elasticsearch was to be able to query on a huge data set quickly. Please throw some light on why it is consuming so much of memory. Is it because of having one-off shard per index ? Please advise how to get around this.

Can you share the JSON for the query and some indication of number of unique values matched in the fields targeted by aggs?

Please check the query below. Please bear with it as it is quite large.

{
      "query": {
        "constant_score": {
          "filter": {
            "bool": {
              "must": [
                {
                  "range": {
                    "interval_start_time": {
                      "lt": "2017-07-09T18:30:00+00:00",
                      "gte": "2017-07-10T18:30:00+00:00"
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "group_by_time": {
          "date_histogram": {
            "field": "interval_start_time",
            "interval": "10m"
          },
          "aggs": {
            "Junctions": {
              "terms": {
                "field": "junction_no",
                "size": 1000
              },
              "aggs": {
                "arms": {
                  "terms": {
                    "field": "arm_no",
                    "size": 1000
                  },
                  "aggs": {
                    "speed_oth": {
                      "avg": {
                        "field": "speed.oth"
                      }
                    },
                    "vehicle_length_2wh": {
                      "avg": {
                        "field": "vehicle_length.2wh"
                      }
                    },
                    "speed_total": {
                      "avg": {
                        "field": "speed.total"
                      }
                    },
                    "count_car": {
                      "sum": {
                        "field": "count.car"
                      }
                    },
                    "occupancy_total": {
                      "avg": {
                        "field": "occupancy.total"
                      }
                    },
                    "occupancy_2wh": {
                      "avg": {
                        "field": "occupancy.2wh"
                      }
                    },
                    "density_total": {
                      "avg": {
                        "field": "density.total"
                      }
                    },
                    "occupancy_car": {
                      "avg": {
                        "field": "occupancy.car"
                      }
                    },
                    "vehicle_length_car": {
                      "avg": {
                        "field": "vehicle_length.car"
                      }
                    },
                    "headway_3wh": {
                      "avg": {
                        "field": "headway.3wh"
                      }
                    },
                    "density_bus": {
                      "avg": {
                        "field": "density.bus"
                      }
                    },
                    "vehicle_length_oth": {
                      "avg": {
                        "field": "vehicle_length.oth"
                      }
                    },
                    "speed_bus": {
                      "avg": {
                        "field": "speed.bus"
                      }
                    },
                    "headway_car": {
                      "avg": {
                        "field": "headway.car"
                      }
                    },
                    "count_bus": {
                      "sum": {
                        "field": "count.bus"
                      }
                    },
                    "occupancy_3wh": {
                      "avg": {
                        "field": "occupancy.3wh"
                      }
                    },
                    "vehicle_length_bus": {
                      "avg": {
                        "field": "vehicle_length.bus"
                      }
                    },
                    "avg_queue": {
                      "bucket_script": {
                        "buckets_path": {
                          "totQueue": "total_queue_length",
                          "totObs": "queue_length_obs_count"
                        },
                        "script": "params.totQueue / params.totObs"
                      }
                    },
                    "density_2wh": {
                      "avg": {
                        "field": "density.2wh"
                      }
                    },
                    "speed_3wh": {
                      "avg": {
                        "field": "speed.3wh"
                      }
                    },
                    "total_queue_length": {
                      "sum": {
                        "script": "doc['queue_length.obs_count'].value * doc['queue_length.avg_length'].value"
                      }
                    },
                    "density_3wh": {
                      "avg": {
                        "field": "density.3wh"
                      }
                    },
                    "queue_length_obs_count": {
                      "sum": {
                        "field": "queue_length.obs_count"
                      }
                    },
                    "speed_2wh": {
                      "avg": {
                        "field": "speed.2wh"
                      }
                    },
                    "count_3wh": {
                      "sum": {
                        "field": "count.3wh"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      },
      "size": 0
    }

Nested buckets are created on the basis of below three fields -

  • interval_start_time
    • junction_no(junction number)
      • arm_no(arm number)

'interval_start_time' is a datetime field and buckets are being created for every 10 minutes interval.
'junction_no' is an integer field and there can be at most 45 unique junction numbers.
'arm_no' is an integer field and there can be at most 4 unique arm numbers for every junction number.

Thanks. Oddly the query is missing the agg named in the original error message ("count_2wh").

This can't be the query that is causing the error?

Yeah...you are correct. I was not able to copy the entire query as its having more than 8500 characters.

So I deleted some of those aggregations. :frowning: Extremely sorry for this. Can you please have a look into this link https://www.dropbox.com/s/iakio1o7z36rp1q/es_query.json?dl=0

No problem - thanks for sharing more details.
This is a little puzzling given the numbers you've quoted.
Can you supply the stack trace that accompanied the error message you posted?

Let's strip back to the simplest agg tree to help us confirm these numbers are true:

You can do this using this simplified query:

{
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [{
						"range": {
							"interval_start_time": {
								"lt": "2017-07-09T18:30:00+00:00",
								"gte": "2017-07-10T18:30:00+00:00"
							}
						}
					}]
				}
			}
		}
	},
	"aggs": {
		"group_by_time": {
			"date_histogram": {
				"field": "interval_start_time",
				"interval": "10m"
			},
			"aggs": {
				"Junctions": {
					"terms": {
						"field": "junction_no",
						"size": 1000
					},
					"aggs": {
						"arms": {
							"terms": {
								"field": "arm_no",
								"size": 1000
							}
						}
					}
				}
			}
		}
	},
	"size": 0
}

Can you verify the results for this query are as expected here (max 45 junctions per time slot and max 4 arm numbers per junction)?

The figures I have given to you are related to production environment. So the figures I am using in my local system for testing purpose is -

unique Junction_numbers : 90
unique arm_numbers per junction : 2

But the total number of buckets should be same as the expected prod figures as (454) == (902)

Please find the output here https://www.dropbox.com/s/yu3r4md77uqkvxl/es_query_output.json?dl=0

Please find the full exception stack trace here https://www.dropbox.com/s/fzkxo6cvkmtgyy6/es_query_exception.txt?dl=0

Thanks for this.
From the stack trace it looks like elasticsearch has chosen a mode of execution which may not be the best one. The assumption it has made is that there are many unique values in the data, not all of which will be required in the search results and so has opted for a breadth_first approach to growing elements in the aggregation tree. In your case there are only a modest number of unique values, all of which are required in results.

This mistaken assumption means that the collection logic is incurring extra time and memory running this query. You can override this behaviour by setting the collect_mode parameter on the junction and arm terms aggs to depth_first (see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode )

Can you give this a try and see if this improves the memory usage?

Thanks for the suggestion but No luck :frowning:

Stack trace - https://www.dropbox.com/s/nijc27j83pf3qvz/es_query_exception1.txt?dl=0

Looks like the same stack trace as previous post? Same timestamps etc

Yes..sorry for that. Please check this

OK - this error looks like bad estimation on our part when it comes to the cost of calculating averages.
We assume each avg result costs 5kb of memory and the real cost is much less. The over-estimation is normally non-fatal but your query has amplified the issue because of:

  1. The large number of "parent" aggregation buckets in the aggregation tree
  2. The large number of avg aggregations on different fields in each "leaf" bucket.

I've opened an issue to deal with this and for the moment the only route open to you is to limit the number of avg aggregations or parent-level buckets in any one request.

@Mark_Harwood Thanks a lot for your help. Hope it will be fixed at the earliest. I will try to figure something out for now.

Having examined your request a little more it is perhaps advantageous for the over-zealous size estimator to have caught the query at this early stage - the final response is likely to be in the realms of half a gigabyte of data which is not ideal.

This would be an good example for using the new composite aggregation and the after parameter to page through results.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.