Hi All,
Today morning, I got an error while checking kibana discover stating that its unable to fetch the result. Then I checked Stack Monitoring and found the below error --
On further investigation, I found that out of 3 elasticsearch nodes, 1 node was down and another was heavily loaded. So I restarted the node which was down and then also the issue was persisting --
After that, I restarted the other two nodes, and then the issue was gone.
Now, further checking the elasticsearch cluster log, I found several entries for CircuitBreakingException: [parent] Data too large` error which was occurring since yesterday.
Please see below some of the snippets found from the log --
[2022-11-29T12:11:34,379][ERROR][o.e.x.s.TransportSubmitAsyncSearchAction] [node-1] failed to store async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8183624210/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8180977664/7.6gb], new bytes reserved: [2646546/2.5mb], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=2667158/2.5mb, model_inference=0/0b, accounting=10484420/9.9mb]
[2022-11-29T12:11:34,584][ERROR][o.e.x.c.a.AsyncResultsService] [node-1] failed to update expiration time for async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8239698386/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8239697920/7.6gb], new bytes reserved: [466/466b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=21544/21kb, model_inference=0/0b, accounting=10494204/10mb]
[2022-11-29T12:11:35,134][ERROR][o.e.x.c.a.DeleteAsyncResultsService] [node-1] failed to clean async result [FkhBVFVEMXZoVHR1MUFUMzdzMzJZVHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE0OTYxMw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/bulk[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [8357138758/7.7gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8357138432/7.7gb], new bytes reserved: [326/326b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=20938/20.4kb, model_inference=0/0b, accounting=10494204/10mb]
Then I found this --
[2022-11-29T19:51:53,924][INFO ][o.e.c.c.Coordinator ] [node-1] master node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed [3] consecutive checks
at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1184) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.13.0.jar:7.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-2][172.31.8.228:9300][internal:coordination/fault_detection/leader_check] request_id [60334983] timed out after [10007ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1185) ~[elasticsearch-7.13.0.jar:7.13.0]
... 4 more
[2022-11-29T19:51:53,926][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}], current []}, term: 25, version: 21685, reason: becoming candidate: onLeaderFailure
[2022-11-29T19:51:53,929][INFO ][o.e.x.w.WatcherService ] [node-1] paused watch execution, reason [no master node], cancelled [0] queued tasks
[2022-11-29T19:51:54,070][INFO ][o.e.c.s.MasterService ] [node-1] elected-as-master ([2] nodes joined)[{node-3}{CGRUQiYQRp6wANZ3-nQflA}{7G80d_GOTtCkLQOTLeZN6Q}{172.31.1.110}{172.31.1.110:9300}{cdfhilmrstw} elect leader, {node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 26, version: 21686, delta: master node changed {previous [], current [{node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw}]}
[2022-11-29T19:52:03,943][WARN ][o.e.t.OutboundHandler ] [node-1] send message failed [channel: Netty4TcpChannel{localAddress=/172.31.6.214:47586, remoteAddress=172.31.8.228/172.31.8.228:9300, profile=default}]
io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
ELK stack info --
ELK Version - 7.11.1
Subscription - Platinum
ES Nodes - 3
Node config - Disk space - 1 TB, Memory - 32 GB, Cores - 8
Disk Available - 94.11% (Current) | Total Size - 2.8 TB
JVM Heap - 42.97% (Current) | Total heap 24 GB (8 GB for each node)
Indices - 189
Documents - 290,418,633
Disk Usage - 127.1 GB
Primary Shards - 189
Replica Shards - 189
Machine Learning Job - 19
All 3 Nodes are AWS servers
Kibana and logstash reside in separate AWS servers
Watcher enabled
A few weeks back I noticed a similar incident in our ELK environment.
Can you please help me to find the root cause of the issue? Since it's a production stack, these events cause a serious impact on client monitoring.
Also, I was unable to find the reason why and when node 2 went down. How can I find that?
Regards,
Souvik