We monitor elasticsearch by using -cluster/stats. However it seems to give false alerts or we have some underlying issue to be solved.
Cluster contains of 3 nodes. We use curator to periodically delete old indexes and roll indexes. Elasticsearch seems to work fine and curator logs have no errors.
On rare occasion during cluster stats there is an exception:
[2018-12-15T15:30:03,719][INFO ][o.e.c.r.a.AllocationService] [gdph-es-01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards
started [[gh-partition-000592][1]] ...]).
[2018-12-15T15:40:03,403][INFO ][o.e.c.m.MetaDataDeleteIndexService] [gdph-es-01] [gh-partition-000538/mY4H42tjQsS72_WEomcpzw] deleting index
[2018-12-15T16:30:04,647][INFO ][o.e.c.m.MetaDataDeleteIndexService] [gdph-es-01] [gh-partition-000539/Q5jRkFC5RDWt76DMzdFK_g] deleting index
[2018-12-15T16:30:04,790][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [gdph-es-01] failed to execute on node [mGitnIAFQuilix-OjpWU8g]
org.elasticsearch.transport.RemoteTransportException: [gdph-es-03][10.201.236.37:9201][cluster:monitor/stats[n]]
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1988) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:961) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:211) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:106) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:50) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:259) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:317) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Since this happens during curator action can it possibly be a race condition and if so any ideas how to avoid it? Client service itself seems to be unaffected,error rate and response times did not change.