Hi all,
we are using the ELK stack to have realtime log in our complex platform.
During the night we had an issue with Logstash that was not able to write in Elasticsearch because one of the shards was unable to recover.
We are using the complete stack with version 5.6.
These are the logs we have in Elasticsearch:
[2018-04-05T23:48:17,185][WARN ][o.e.c.a.s.ShardStateAction] [870632-SPWLOG01] [iislog-2018.04.05][3] received shard failed for shard id [[iislog-2018.04.05][3]], allocation id [4Q-GWqd-RRKvht8nML4FrA], primary term [0], message [shard failure, reason [lucene commit failed]], failure [FileSystemException[D:\Elasticsearch\data\nodes\0\indices\CKrJStXdRqeMjlmeTmrouA\3\index_5qc_Lucene50_0.doc: The process cannot access the file because it is being used by another process.
]]
java.nio.file.FileSystemException: D:\Elasticsearch\data\nodes\0\indices\CKrJStXdRqeMjlmeTmrouA\3\index_5qc_Lucene50_0.doc: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[?:?]
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?]
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[?:?]
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_144]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_144]
at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:471) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:327) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:285) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:83) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:83) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4724) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3085) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3244) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3207) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1576) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1062) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:253) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:221) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:92) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1033) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:987) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.0.jar:5.6.0]
at
[2018-04-05T23:48:31,177][WARN ][o.e.i.e.Engine ] [870632-SPWLOG01] [iislog-2018.04.05][3] tried to fail engine but engine is already failed. ignoring. [failed to recover from translog]
org.elasticsearch.index.engine.FlushFailedEngineException: Flush failed
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1069) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:253) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:221) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:92) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1033) ~[elasticsearch-5.6.0.jar:5.6.0]
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:987) ~[elasticsearch-5.6.0.jar:5.6.0]
Caused by: java.nio.file.FileSystemException: D:\Elasticsearch\data\nodes\0\indices\CKrJStXdRqeMjlmeTmrouA\3\index_8tv_Lucene50_0.pos: The process cannot access the file because it is being used by another process.
And these are the logs in Logstash
[2018-04-06T00:00:48,722][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[iislog-2018.04.05][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[iislog-2018.04.05][3]] containing [18] requests]"})
[2018-04-06T00:00:48,722][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[iislog-2018.04.05][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[iislog-2018.04.05][3]] containing [18] requests]"})
Did someone had this issue already?