Background thread had an uncaught exception: org.elasticsearch.ElasticsearchException: failed to refresh store stats

Hi,

I have a two nodes elk cluster, with elasticsearch running version 1.7.2.
Somehow a couple of days ago, elasticsearch started giving below exception continuously. When this error happened, indexer can still send data to elasticsearch in the beginning, but after a few days, it started to fail with error 503.

Elasticsearch exception:
[ERROR][marvel.agent ] [clustername] Background thread had an uncaught exception:
org.elasticsearch.ElasticsearchException: failed to refresh store stats
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1573)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1558)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:55)
at org.elasticsearch.index.store.Store.stats(Store.java:290)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:639)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:139)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:55)
at org.elasticsearch.indices.IndicesService.stats(IndicesService.java:231)

logstash indexer error:
:message=>"retrying failed action with response code: 503

The server's hardware, CPU, memory all looks OK.
The problem goes away after I restart the elasticsearch service.

What caused this problem? How to prevent it from happening again? Is there a way to monitor this error or ES can send out notifications?

503 suggests that your thread pools may be overloaded, what does Marvel show them at?

Hi, thank you for replying.
there are various thread pool statistics in Marvel, I assume you're referring to the one related to indexing. let me know if I'm wrong.
The index thread pool thread count is constantly at 32
index thread pool rejected count at 0
index thread pool ops per sec is usually at 0, but sometimes may go up bit, highest is 0.003
index thread pool queue size is always 0
INDEX THREAD POOL LARGEST THREAD COUNT is at 32

We are also seeing the same issue in our ES cluster (1.7.5). Following is the stacktrace on some of our data nodes.

[2016-06-13 08:36:53,111][ERROR][marvel.agent ] [es-data-NODE-XX] Background thread had an uncaught exception:
org.elasticsearch.ElasticsearchException: failed to refresh store stats
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1573)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1558)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:55)
at org.elasticsearch.index.store.Store.stats(Store.java:290)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:638)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:139)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:55)
at org.elasticsearch.indices.IndicesService.stats(IndicesService.java:231)
at org.elasticsearch.indices.IndicesService.stats(IndicesService.java:188)
at org.elasticsearch.node.service.NodeService.stats(NodeService.java:138)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:342)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:254)
at java.lang.Thread.run(Unknown Source)

This is quite confusing but we observe that the ES Cluster health API shows that the cluster has all nodes whereas Marvel shows the nodes with the above exception to be missing.

At the same time I see all other nodes throwing the below stacktrace. I checked, there is no communication issue between the nodes.
[2016-06-12 15:00:48,721][DEBUG][action.admin.cluster.node.stats] [es-client-XX] failed to execute on node [v1-ua9fZSym6v-wjVtzucQ]
org.elasticsearch.transport.RemoteTransportException: [es-data-NODE-XX][inet[/192.168.XX.XX:9300]][cluster:monitor/nodes/stats[n]]
Caused by: org.elasticsearch.ElasticsearchException: failed to refresh store stats
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1573)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1558)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:55)
at org.elasticsearch.index.store.Store.stats(Store.java:290)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:638)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:139)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:55)
at org.elasticsearch.indices.IndicesService.stats(IndicesService.java:231)
at org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:96)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:44)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:292)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:283)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)

Any help here ? ?

Biju