We have a 7.4.2 Elasticsearch cluster with Kibana. We keep getting the following messages in our log files.
{
"type": "server",
"timestamp": "2020-04-14T00:00:10,274Z",
"level": "DEBUG",
"component": "o.e.a.s.TransportSearchAction",
"cluster.name": "OUR_CLUSTER_NAME",
"node.name": "es-coordinating-node1",
"message": "[.kibana_task_manager_2][0], node[Cg2AmralQPCN9Szrg07bIw], [R], s[STARTED], a[id=zxo_7v7yRVmKYrdF69zgpw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[.kibana_task_manager], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":{\"value\":\"task\",\"boost\":1.0}}},{\"bool\":{\"filter\":[{\"term\":{\"_id\":{\"value\":\"task:Maps-maps_telemetry\",\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"sort\":[{\"task.runAt\":{\"order\":\"asc\"}},{\"_id\":{\"order\":\"desc\"}}]}}]",
"cluster.uuid": "BYsSqcOITleJZYD1yOhbGw",
"node.id": "YCCsp7goTYeeQuFerwmaJA",
"stacktrace": [
"org.elasticsearch.transport.RemoteTransportException: [es-data-node1][10.0.1.170:9300][indices:data/read/search[phase/query]]",
"Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]",
"at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:305) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:335) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:355) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$7(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: org.elasticsearch.ElasticsearchException: java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12819847882/11.9gb], which is larger than the limit of [11139022848/10.3gb]]",
...
Trace omitted for post character limit
]
}
{
"type": "server",
"timestamp": "2020-04-14T00:00:10,359Z",
"level": "WARN",
"component": "r.suppressed",
"cluster.name": "OUR_CLUSTER_NAME",
"node.name": "es-coordinating-node1",
"message": "path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}",
"cluster.uuid": "BYsSqcOITleJZYD1yOhbGw",
"node.id": "YCCsp7goTYeeQuFerwmaJA",
"stacktrace": [
"org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:314) [elasticsearch-7.4.2.jar:7.4.2]",
...
Trace omitted for post character limit
]
}
{
"type": "server",
"timestamp": "2020-04-14T00:00:10,357Z",
"level": "DEBUG",
"component": "o.e.a.s.TransportSearchAction",
"cluster.name": "OUR_CLUSTER_NAME",
"node.name": "es-coordinating-node1",
"message": "All shards failed for phase: [query]",
"cluster.uuid": "BYsSqcOITleJZYD1yOhbGw",
"node.id": "YCCsp7goTYeeQuFerwmaJA",
"stacktrace": [
"org.elasticsearch.common.breaker.CircuitBreakingException: [fielddata] Data too large, data for [_id] would be [12819847882/11.9gb], which is larger than the limit of [11139022848/10.3gb]",
"at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:98) ~[elasticsearch-7.4.2.jar:7.4.2]",
...
Trace omitted for post character limit
]
}
And this message is from our kibana logs:
{"type":"log","@timestamp":"2020-04-28T13:13:36Z","tags":["warning","stats-collection"],"pid":25382,"message":"Unable to fetch data from maps collector"}
{
"type": "error",
"@timestamp": "2020-04-28T13:13:36Z",
"tags": [
"warning",
"stats-collection"
],
"pid": 25382,
"level": "error",
"error": {
"message": "[circuit_breaking_exception] [fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb], with { bytes_wanted=11140489384 & bytes_limit=11139022848 & durability=\"PERMANENT\" }",
"name": "Error",
"stack": "[circuit_breaking_exception] [fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb], with { bytes_wanted=11140489384 & bytes_limit=11139022848 & durability=\"PERMANENT\" } :: {\"path\":\"/.kibana_task_manager/_search\",\"query\":{\"ignore_unavailable\":true},\"body\":\"{\\\"sort\\\":[{\\\"task.runAt\\\":\\\"asc\\\"},{\\\"_id\\\":\\\"desc\\\"}],\\\"query\\\":{\\\"bool\\\":{\\\"must\\\":[{\\\"term\\\":{\\\"type\\\":\\\"task\\\"}},{\\\"bool\\\":{\\\"filter\\\":{\\\"term\\\":{\\\"_id\\\":\\\"task:oss_telemetry-vis_telemetry\\\"}}}}]}}}\",\"statusCode\":500,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]\\\",\\\"bytes_wanted\\\":11140489384,\\\"bytes_limit\\\":11139022848,\\\"durability\\\":\\\"PERMANENT\\\"}],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[{\\\"shard\\\":0,\\\"index\\\":\\\".kibana_task_manager_2\\\",\\\"node\\\":\\\"ujE6xO4IRfWNNkR4fUF6Wg\\\",\\\"reason\\\":{\\\"type\\\":\\\"exception\\\",\\\"reason\\\":\\\"java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]]\\\",\\\"caused_by\\\":{\\\"type\\\":\\\"execution_exception\\\",\\\"reason\\\":\\\"execution_exception: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]]\\\",\\\"caused_by\\\":{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]\\\",\\\"bytes_wanted\\\":11140489384,\\\"bytes_limit\\\":11139022848,\\\"durability\\\":\\\"PERMANENT\\\"}}}}],\\\"caused_by\\\":{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]\\\",\\\"bytes_wanted\\\":11140489384,\\\"bytes_limit\\\":11139022848,\\\"durability\\\":\\\"PERMANENT\\\"}},\\\"status\\\":500}\"}\n at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15)\n at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7)\n at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)\n at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4929:19)\n at IncomingMessage.emit (events.js:194:15)\n at endReadableNT (_stream_readable.js:1103:12)\n at process._tickCallback (internal/process/next_tick.js:63:19)"
},
"message": "[circuit_breaking_exception] [fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb], with { bytes_wanted=11140489384 & bytes_limit=11139022848 & durability=\"PERMANENT\" }"
}
The messages started coming in on April 14th (we only noticed recently because the node was running low on disk space), but we hadn't taken any special action on the 14th that would have caused the error. The cluster continuously takes in data, but all of those processes have been running smoothly. The cluster is all healthy, all statuses green and everything.
Stopping Kibana stopped the errors from getting logged in our Elasticsearch logs (and also in the kibana logs, naturally). Restarting Kibana freed up a lot of disk space, but didn't fix the errors.
From what I read in the logs "something" (I think an internal ES process - o.e.a.s.TransportSearchAction?) queries .kibana_task_manager with a query like
GET /.kibana_task_manager/_search
{
"sort": [
{
"task.runAt": "asc"
},
{
"_id": "desc"
}
],
"query": {
"bool": {
"must": [
{
"term": {
"type": "task"
}
},
{
"bool": {
"filter": {
"term": {
"_id": "task:oss_telemetry-vis_telemetry"
}
}
}
}
]
}
}
}
to find "task:oss_telemetry-vis_telemetry", and then this query causes a CircuitBreakerException. (with the following reasons, when I run the query manually:
either
java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12674571573/11.8gb], which is larger than the limit of [11139022848/10.3gb]]
or
java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [11140489384/10.3gb], which is larger than the limit of [11139022848/10.3gb]]
). The exception goes away when I run the query without the sort on "_id", but I have no idea why that would cause an issue - .kibana_task_manager only has 2 small documents. And besides that, we aren't the ones who set off the query, so we don't know of a way to edit it.
My general question is: how do I stop these errors from occurring? I figure we could increase the circuit limit, but that seems like the wrong way to go about things - this query should not come close to the limit, and those limits seem to be there for good reasons. My specific questions which I think might help are: Why is sort causing CircuitBreakerException? and what is making the query?