I have a 2-node elasticsearch cluster (v2.3.1, and also tested against v2.3.4 - same behaviour, and on a single node es instance - again, same behaviour).
Each node is on it's own AWS instance (m3.2xlarge). The ES_HEAP_SIZE is set as 15g (on a 30g machine).
We tried creating a kibana visualization:
It results in this query:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"MyDateField": {
"gte": 1314114051459,
"lte": 1471966851459,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "MyStringField",
"size": 10,
"order": {
"_count": "desc"
}
},
"aggs": {
"3": {
"terms": {
"field": "MyOtherStringField",
"size": 50,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
The visualization times out after 30 seconds, and the es cluster becomes unresponsive.
Looking in the logs for elasticsearch, there is nothing except jvm.monitor
warning messages, suggesting that the system is under heavy load.
After a good while, the system remains unresponsive. Trying to shut it down using:
sudo service elasticsearch stop
doesn't work. In the end we have to kill the es process on each node in turn and start it up again manually.
We can re-create with direct queries like this:
This query works:
{
"size": 0,
"aggs": {
"2": {
"terms": {
"field": "MyStringField",
"size": 10,
"order": {
"_count": "desc"
}
}
}
}
}
This query kills the cluster (note nested agg):
{
"size": 0,
"aggs": {
"2": {
"terms": {
"field": "MyStringField",
"size": 10,
"order": {
"_count": "desc"
}
},
"aggs": {
"3": {
"terms": {
"field": "MyOtherStringField",
"size": 50,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
My questions are:
- Why is this happening?
- If the query is simply too heavy, are there no circuit breakers for this sort of thing?
- How can I stop this from happening and improve reliability of our cluster?
Please note: We cannot upgrade to v2.3.5 because we need to use the knapsack plugin, which has yet to deliver a version for v2.3.5.
I would appreciate any help that anyone can offer. If you need any more information, please let me know and I'll happily provide it.
Thanks!