Hello everyone,
I have this exception when trying to update some documents in one of my indexes.
I will post some of my elasticsearch cluster configuration (es version, index state) in next post.
[2016-04-28 08:51:57,455][WARN ][cluster.action.shard ] [Lightspeed] [caparmor_2015][3] received shard failed for target shard [[caparmor_2015][3], node[Y1kc958iT46nd-EtVbUKgQ], [P], v[7], s[STARTED], a[id=8C35eXqIR8WZ0sikSeRDqQ]], indexUUID [fVYBjB8-SSenL4KxCk_3TA], message [engine failure, reason [already closed by tragic event on the index writer]], failure [AlreadyClosedException[Underlying file changed by an external force at 2016-04-15T10:59:09.550936Z, (lock=NativeFSLock(path=/usr/share/elasticsearch/data/caparmor/nodes/0/indices/caparmor_2015/3/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],ctime=2016-04-15T10:59:09.550936Z))]]
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2016-04-15T10:59:09.550936Z, (lock=NativeFSLock(path=/usr/share/elasticsearch/data/caparmor/nodes/0/indices/caparmor_2015/3/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],ctime=2016-04-15T10:59:09.550936Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:108)
at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
at org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)
at org.apache.lucene.index.DefaultIndexingChain.initStoredFieldsWriter(DefaultIndexingChain.java:81)
at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:279)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:273)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1318)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:539)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-04-28 08:51:57,458][INFO ][cluster.routing.allocation] [Lightspeed] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[caparmor_2015][3]] ...]).
[2016-04-28 08:52:32,788][INFO ][cluster.routing.allocation] [Lightspeed] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[caparmor_2015][3]] ...]).
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open caparmor_2015 5 1 7459462 183 6.6gb 3.3gb
docker --version
Docker version 1.10.3, build 20f81dd
Elasticsearch data is not stored inside the containers, I made a volume mount on each host in order to keep data ok if something went bad with the cluster.
So, I don't really know what I could do to this... Thanks in advance to those who might help me
Perhaps its a broken filesystem (e.g. NFS) or something like that. Other explanations include someone deleting the write.lock file manually while writers are still open against it.
But we know that the lock is no longer valid, so we fail hard rather than corrupt data.
Is there a better solution than insisting when indexing ?
I'm currently trying to find out the possible filesystem problems that could induce this problem....
I think it could be enough to even touch the write.lock file. I realize this may seem harsh, but we have to detect that our lock may no longer be valid because its e.g. a different file. inode number is imperfect, because inode numbers get recycled. so we use ctime for now. Just make sure nothing messes with write.lock, seriously don't even touch it
After investigating a little, I could observe that when this problem occurs, the write.lock file is not deleted.
But something else happens :
I made a little script that lists the write.lock file for every shard in the cluster and tried to see what happens each time I want to index a document.
Here is an example to show the way it happens :
1> Try to index a simple document. ES chooses shard 1 for example
2> Shard 1 (primary) failed because of write.lock
3> Shard 1 is relocated to another node : new write.lock ile is created with the actual date
4> Retries to index. It seems like ES tries now with the replica of shard 1
5> Shard 1 (replica) is relocated also. New write.lock file is created as in 3>
6> Last attempt works well. No shard is relocated this time.
Is this the normal behaviour ?
Problem is that next time I will try to index and ES will choose this shard, the same thing will happen, I will have to insist two times before succeeding.
the question is why does shard 1 fail in 2> in the first place? What are you doing to that write.lock? Are you using a shared filesystem by any chance?
Hi, thanks for your answer.
I don't do anything to the write.lock file. In fact, I discovered its existence with these error messages.
But now, indeed, I look after it because it is what causes failures to my indexing attempts (sorry, my english may not be proper).
I am not using shared filesystems. ES data is on XFS, each fs is dedicated to one node.
I found the probable cause of these failures :
--> data directories are mounted on the docker container. When I list file from the host :
-rw-r--r-- 1 ntp ssh 0 May 9 10:05 write.lock
When I list file from the docker container :
-rw-r--r-- 1 elasticsearch elasticsearch 0 May 9 08:05 write.lock
This is because my docker containers are using this timezone
root@e89522442201:/usr/share/elasticsearch# cat /etc/timezone
Etc/UTC
whereas my host is using this timezone
root@vsyslog6:~ # cat /etc/timezone
Europe/Paris
I guess I will either have to change the timezone used by either one or the other ...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.