We are using elasticsearch 7.8.0 on centos 7. There are three data pods in the cluster we are restarting one of the data pod. Suddenly, elasticsearch curl to service stopped working and the following error observes in the elasticsearch elected master pod logs.
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:00:00.654Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:01:00.656Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:02:00.658Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:02:45.659Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update node information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:03:00.660Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:04:00.662Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:05:00.664Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
{"type":"log","host":"abc-elasticsearch-master-0","level":"WARN","system":"abc","time": "2021-01-15T08:06:00.666Z","logger":"o.e.c.InternalClusterInfoService","timezone":"UTC","marker":"[abc-elasticsearch-master-0] ","log":{"message":"Failed to update shard information for ClusterInfoUpdateJob within 15s timeout"}}
some of the rest commands which are working
curl _cat/health
curl _nodes/hot_threads?threads=9999
some of the rest commands which are not working
curl _cat/indices
curl _cat/allocation
curl _cat/nodes
Following is the some output of _nodes/hot_threads?threads=9999
::: {abc-elasticsearch-client-768fd6bcc6-68zgd}{cfIWxQEjS_qdtX9e-Z4mxw}{GoUBp8MNS8CLACv7oYItPg}{aa.aa.aa.aa}{aa.aa.aa.aa:9300}{i}
Hot threads at 2021-01-15T11:01:09.333Z, interval=500ms, busiestThreads=9999, ignoreIdleThreads=true:
0.0% (24.3micros out of 500ms) cpu usage by thread 'signals/_main_Worker-1'
10/10 snapshots sharing following 2 elements
java.base@11.0.9/java.lang.Object.wait(Native Method)
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)
0.0% (24.1micros out of 500ms) cpu usage by thread 'signals/_main_Worker-2'
10/10 snapshots sharing following 2 elements
java.base@11.0.9/java.lang.Object.wait(Native Method)
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)
0.0% (13.6micros out of 500ms) cpu usage by thread 'signals/_main_Worker-3'
10/10 snapshots sharing following 2 elements
java.base@11.0.9/java.lang.Object.wait(Native Method)
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/7.8.0]'
10/10 snapshots sharing following 8 elements
java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.base@11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
java.base@11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
java.base@11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
java.base@11.0.9/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
app//org.elasticsearch.bootstrap.Bootstrap$1.run(Bootstrap.java:89)
java.base@11.0.9/java.lang.Thread.run(Thread.java:834)
0.0% (0s out of 500ms) cpu usage by thread 'Common-Cleaner'
10/10 snapshots sharing following 5 elements
java.base@11.0.9/java.lang.Object.wait(Native Method)
java.base@11.0.9/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
java.base@11.0.9/jdk.internal.ref.CleanerImpl.run(CleanerImpl.java:148)
java.base@11.0.9/java.lang.Thread.run(Thread.java:834)
java.base@11.0.9/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:134)
Could someone please suggest how to resolve this issue?
Thanks,