Hi,
We are using ElasticSearch 1.3.4 on Windows Azure IaaS VMs (using storage accounts/disks to house the actual data) and have seen this happen couple of times. The translog file fails to recover, with the following error, and when we dig deeper into the translog file, it seems like one of the document we wanted to add was really large (around 4 MB) and it seems like only part of the file was indexed, and the next document entry starts from there. This causes the version mismatch error as its trying to interpret the translog.
Question:
- What guarantees exist in terms of translog operation entries being written atomically? Is it possible that for large files we can potentially hit this more often and something we should expect?
- Is there an easier workaround than deleting the translog-*.recovering files, because that will require re-indexing all the affected docs again.
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index_1][6] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:269)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.ElasticsearchException: failed to read [MyFileContract][XXX@FileName]
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:511)
at org.elasticsearch.index.translog.TranslogStreams.readTranslogOperation(TranslogStreams.java:52)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:241)
... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:508)
... 6 more
Any leads or pointers to get further into debugging this would be great,
Thanks!