Elasticsearch one shard of the index keeps going into initialization

Hi,

Every now and again one shard of an index goes into initialization, no message in the log that indicate there is a problem.

/var/log/elasticsearch/elasticsearch.log:[2016-03-21 10:59:56,301][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] creating index, cause [auto(bulk api)], templates [wcms], shards [6]/[0], mappings [default, varnish.varnish
ncsa.log, varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 10:59:56,401][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] update_mapping [varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 10:59:56,408][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] update_mapping [varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 10:59:56,419][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] update_mapping [varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 10:59:59,335][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] update_mapping [varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 12:05:53,863][INFO ][cluster.metadata ] [Bushwacker] [apache-varnish-2016.03.21] update_mapping [varnish.varnishncsa.log]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 15:45:41,493][WARN ][index.engine ] [Bushwacker] [apache-varnish-2016.03.21][3] failed engine [refresh failed]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 15:45:41,493][WARN ][index.shard ] [Bushwacker] [apache-varnish-2016.03.21][3] Failed to perform scheduled engine refresh
/var/log/elasticsearch/elasticsearch.log:[apache-varnish-2016.03.21][[apache-varnish-2016.03.21][3]] RefreshFailedEngineException[Refresh failed]; nested: NullPointerException;
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 15:45:41,496][WARN ][indices.cluster ] [Bushwacker] [[apache-varnish-2016.03.21][3]] marking and sending shard failed due to [engine failure, reason [refresh failed]]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 15:45:41,497][WARN ][cluster.action.shard ] [Bushwacker] [apache-varnish-2016.03.21][3] received shard failed for [apache-varnish-2016.03.21][3], node[NPeu22XsRzK1cuFkjV7Ktw], [P], v[2], s[STARTED], a[id=
kNj3YPV4Q6OE7qOIET1t3w], indexUUID [r7HAjs1mQq2_kO4zyYFb7w], message [engine failure, reason [refresh failed]], failure [NullPointerException[null]]
/var/log/elasticsearch/elasticsearch.log:[2016-03-21 15:45:41,532][WARN ][index.translog ] [Bushwacker] [apache-varnish-2016.03.21][3] failed to delete temp file /opt/elasticsearch/es-data/elasticsearch/nodes/0/indices/apache-varnish-2016.03.21/3/tran
slog/translog-1522912244841259470.tlog

The shard changes every time but its only at a time,

After stopping logstash the index recovers, This has started happening after we split of logs indexes up from 1 index into 4?

I have slowed throughput rate of input (rabbitmq) and output(elasticsearch) with no lucky, there doesnt seem to be any memory problems.

Any ideas

Thanks
Wal

Hey,

  • which Elasticsearch version is this?
  • Do you have a full strack trace in the logs of the node, where this shard should have been allocated?

--Alex

Hey Alex,

The problem has not come back after we restart the all the apps on the system, If it happens again I will submit a stack trace,

Thanks for your help

Cheers
Wal

Hi Alex,

Below is a stack trace of the shard initializing (elasticsearch version 2.1.0)

curl -XGET http://localhost:9200/_cat/shards | grep INIT
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 110k 100 110k 0 0 283k 0 --:--:-- --:--:-- --:--:-- 284k
apache-cae-2016.04.04 2 p INITIALIZING 127.0.0.1 Victorius

I have a full stack trace but I break the word limit and cant upload a text file

Any help would be much appreciated

Thanks
Wal

curl -XGET http://localhost:9200/_nodes/hot_threads
::: {Victorius}{P9m2p0NjR2GWyR-l_dOCLw}{127.0.0.1}{127.0.0.1:9300}
   Hot threads at 2016-04-04T07:18:41.113Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   101.1% (505.6ms out of 500ms) cpu usage by thread 'elasticsearch[Victorius][generic][T#135]'
     2/10 snapshots sharing following 29 elements
       org.apache.lucene.codecs.lucene50.Lucene50DocValuesFormat.fieldsConsumer(Lucene50DocValuesFormat.java:197)
       org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:187)
       org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedSetField(PerFieldDocValuesFormat.java:131)
       org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:164)
       org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:142)
       org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:93)
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:422)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:503)
       org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:377)
       org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:467)
       org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
       org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1254)
       org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:448)
       org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:390)
       org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:359)
       org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:158)
       org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:242)
       org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:175)
       org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
       org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
       org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
       org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
       org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
       org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
       org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
       org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
     3/10 snapshots sharing following 23 elements
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:459)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:503)
       org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:377)
       org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:467)
       org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
       org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1254)
       org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:448)
       org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:390)
       org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:359)