Elasticsearch aggregation encounters circuit_breaking_exception

sovanghosh7 · January 14, 2018, 8:46pm

I am using elasticsearch 6.1.1 and creating a single shard per index. The cluster settings is as follows -

    {
    "persistent": {
        "cluster": {
            "routing": {
                "allocation": {
                    "enable": "all"
                }
            }
        },
        "indices": {
            "breaker": {
                "request": {
                    "limit": "80%"
                }
            }
        }
    },
    "transient": {}
}

System RAM : 16 GB
JVM heap size : 4 GB

The reason of using one shard per index is to avoid data approximation performed by elasticsearch on specific cases.

I have an index 'atcc_summary_201707_5' containing nearly 0.14 million documents(having some nested fields) of size 250 mb. I am trying to run an aggregation query over a subset of those those documents. The query involves nested bucketing(up to 3 levels) and some metric aggregations. Every time I'm running the query, it is throwing circuit_breaking_exception with the following reason -

"[parent] Data too large, data for [<agg [count_2wh]>] would be [2982073327/2.7gb], which is larger than the limit of [2982071500/2.7gb]"

I'm literally stuck here as the whole point of using elasticsearch was to be able to query on a huge data set quickly. Please throw some light on why it is consuming so much of memory. Is it because of having one-off shard per index ? Please advise how to get around this.

Mark_Harwood · January 14, 2018, 10:49pm

Can you share the JSON for the query and some indication of number of unique values matched in the fields targeted by aggs?

sovanghosh7 · January 15, 2018, 9:12am

Please check the query below. Please bear with it as it is quite large.

{
      "query": {
        "constant_score": {
          "filter": {
            "bool": {
              "must": [
                {
                  "range": {
                    "interval_start_time": {
                      "lt": "2017-07-09T18:30:00+00:00",
                      "gte": "2017-07-10T18:30:00+00:00"
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "group_by_time": {
          "date_histogram": {
            "field": "interval_start_time",
            "interval": "10m"
          },
          "aggs": {
            "Junctions": {
              "terms": {
                "field": "junction_no",
                "size": 1000
              },
              "aggs": {
                "arms": {
                  "terms": {
                    "field": "arm_no",
                    "size": 1000
                  },
                  "aggs": {
                    "speed_oth": {
                      "avg": {
                        "field": "speed.oth"
                      }
                    },
                    "vehicle_length_2wh": {
                      "avg": {
                        "field": "vehicle_length.2wh"
                      }
                    },
                    "speed_total": {
                      "avg": {
                        "field": "speed.total"
                      }
                    },
                    "count_car": {
                      "sum": {
                        "field": "count.car"
                      }
                    },
                    "occupancy_total": {
                      "avg": {
                        "field": "occupancy.total"
                      }
                    },
                    "occupancy_2wh": {
                      "avg": {
                        "field": "occupancy.2wh"
                      }
                    },
                    "density_total": {
                      "avg": {
                        "field": "density.total"
                      }
                    },
                    "occupancy_car": {
                      "avg": {
                        "field": "occupancy.car"
                      }
                    },
                    "vehicle_length_car": {
                      "avg": {
                        "field": "vehicle_length.car"
                      }
                    },
                    "headway_3wh": {
                      "avg": {
                        "field": "headway.3wh"
                      }
                    },
                    "density_bus": {
                      "avg": {
                        "field": "density.bus"
                      }
                    },
                    "vehicle_length_oth": {
                      "avg": {
                        "field": "vehicle_length.oth"
                      }
                    },
                    "speed_bus": {
                      "avg": {
                        "field": "speed.bus"
                      }
                    },
                    "headway_car": {
                      "avg": {
                        "field": "headway.car"
                      }
                    },
                    "count_bus": {
                      "sum": {
                        "field": "count.bus"
                      }
                    },
                    "occupancy_3wh": {
                      "avg": {
                        "field": "occupancy.3wh"
                      }
                    },
                    "vehicle_length_bus": {
                      "avg": {
                        "field": "vehicle_length.bus"
                      }
                    },
                    "avg_queue": {
                      "bucket_script": {
                        "buckets_path": {
                          "totQueue": "total_queue_length",
                          "totObs": "queue_length_obs_count"
                        },
                        "script": "params.totQueue / params.totObs"
                      }
                    },
                    "density_2wh": {
                      "avg": {
                        "field": "density.2wh"
                      }
                    },
                    "speed_3wh": {
                      "avg": {
                        "field": "speed.3wh"
                      }
                    },
                    "total_queue_length": {
                      "sum": {
                        "script": "doc['queue_length.obs_count'].value * doc['queue_length.avg_length'].value"
                      }
                    },
                    "density_3wh": {
                      "avg": {
                        "field": "density.3wh"
                      }
                    },
                    "queue_length_obs_count": {
                      "sum": {
                        "field": "queue_length.obs_count"
                      }
                    },
                    "speed_2wh": {
                      "avg": {
                        "field": "speed.2wh"
                      }
                    },
                    "count_3wh": {
                      "sum": {
                        "field": "count.3wh"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      },
      "size": 0
    }

Nested buckets are created on the basis of below three fields -

interval_start_time
- junction_no(junction number)
  - arm_no(arm number)

'interval_start_time' is a datetime field and buckets are being created for every 10 minutes interval.
'junction_no' is an integer field and there can be at most 45 unique junction numbers.
'arm_no' is an integer field and there can be at most 4 unique arm numbers for every junction number.

Mark_Harwood · January 15, 2018, 9:49am

Thanks. Oddly the query is missing the agg named in the original error message ("count_2wh").

This can't be the query that is causing the error?

sovanghosh7 · January 15, 2018, 10:05am

Yeah...you are correct. I was not able to copy the entire query as its having more than 8500 characters.

So I deleted some of those aggregations. Extremely sorry for this. Can you please have a look into this link https://www.dropbox.com/s/iakio1o7z36rp1q/es_query.json?dl=0

Mark_Harwood · January 15, 2018, 10:19am

No problem - thanks for sharing more details.
This is a little puzzling given the numbers you've quoted.
Can you supply the stack trace that accompanied the error message you posted?

Let's strip back to the simplest agg tree to help us confirm these numbers are true:

You can do this using this simplified query:

{
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [{
						"range": {
							"interval_start_time": {
								"lt": "2017-07-09T18:30:00+00:00",
								"gte": "2017-07-10T18:30:00+00:00"
							}
						}
					}]
				}
			}
		}
	},
	"aggs": {
		"group_by_time": {
			"date_histogram": {
				"field": "interval_start_time",
				"interval": "10m"
			},
			"aggs": {
				"Junctions": {
					"terms": {
						"field": "junction_no",
						"size": 1000
					},
					"aggs": {
						"arms": {
							"terms": {
								"field": "arm_no",
								"size": 1000
							}
						}
					}
				}
			}
		}
	},
	"size": 0
}

Can you verify the results for this query are as expected here (max 45 junctions per time slot and max 4 arm numbers per junction)?

sovanghosh7 · January 15, 2018, 11:07am

The figures I have given to you are related to production environment. So the figures I am using in my local system for testing purpose is -

unique Junction_numbers : 90
unique arm_numbers per junction : 2

But the total number of buckets should be same as the expected prod figures as (454) == (902)

Please find the output here https://www.dropbox.com/s/yu3r4md77uqkvxl/es_query_output.json?dl=0

Please find the full exception stack trace here https://www.dropbox.com/s/fzkxo6cvkmtgyy6/es_query_exception.txt?dl=0

Mark_Harwood · January 15, 2018, 11:54am

Thanks for this.
From the stack trace it looks like elasticsearch has chosen a mode of execution which may not be the best one. The assumption it has made is that there are many unique values in the data, not all of which will be required in the search results and so has opted for a breadth_first approach to growing elements in the aggregation tree. In your case there are only a modest number of unique values, all of which are required in results.

This mistaken assumption means that the collection logic is incurring extra time and memory running this query. You can override this behaviour by setting the collect_mode parameter on the junction and arm terms aggs to depth_first (see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode )

Can you give this a try and see if this improves the memory usage?

sovanghosh7 · January 15, 2018, 12:13pm

Thanks for the suggestion but No luck

Stack trace - https://www.dropbox.com/s/nijc27j83pf3qvz/es_query_exception1.txt?dl=0

Mark_Harwood · January 15, 2018, 12:18pm

Looks like the same stack trace as previous post? Same timestamps etc

sovanghosh7 · January 15, 2018, 12:19pm

Yes..sorry for that. Please check this

Mark_Harwood · January 15, 2018, 1:28pm

OK - this error looks like bad estimation on our part when it comes to the cost of calculating averages.
We assume each avg result costs 5kb of memory and the real cost is much less. The over-estimation is normally non-fatal but your query has amplified the issue because of:

The large number of "parent" aggregation buckets in the aggregation tree
The large number of avg aggregations on different fields in each "leaf" bucket.

I've opened an issue to deal with this and for the moment the only route open to you is to limit the number of avg aggregations or parent-level buckets in any one request.

sovanghosh7 · January 15, 2018, 1:57pm

@Mark_Harwood Thanks a lot for your help. Hope it will be fixed at the earliest. I will try to figure something out for now.

Mark_Harwood · January 15, 2018, 4:32pm

Having examined your request a little more it is perhaps advantageous for the over-zealous size estimator to have caught the query at this early stage - the final response is likely to be in the realms of half a gigabyte of data which is not ideal.

This would be an good example for using the new composite aggregation and the after parameter to page through results.

system · February 12, 2018, 4:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
IDs query tripping circuit breakers Elasticsearch	7	1541	June 30, 2017
Aggregations Elasticsearch	7	531	July 6, 2017
Elasticsearch aggregation and multiple sub aggregation memory issue Elasticsearch	2	1907	July 5, 2017
Circuit_breaking_exception for [request] Data too large, but index being queried is only 8.44mb Elasticsearch	1	701	September 16, 2022
ElasticSearch CircuitBreakingException Elasticsearch	1	380	June 8, 2018

Elasticsearch aggregation encounters circuit_breaking_exception

Related topics