Translog has incomplete operation written for large documents

Hi,

We are using ElasticSearch 1.3.4 on Windows Azure IaaS VMs (using storage accounts/disks to house the actual data) and have seen this happen couple of times. The translog file fails to recover, with the following error, and when we dig deeper into the translog file, it seems like one of the document we wanted to add was really large (around 4 MB) and it seems like only part of the file was indexed, and the next document entry starts from there. This causes the version mismatch error as its trying to interpret the translog.

Question:

  1. What guarantees exist in terms of translog operation entries being written atomically? Is it possible that for large files we can potentially hit this more often and something we should expect?
  2. Is there an easier workaround than deleting the translog-*.recovering files, because that will require re-indexing all the affected docs again.

org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index_1][6] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:269)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.ElasticsearchException: failed to read [MyFileContract][XXX@FileName]
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:511)
at org.elasticsearch.index.translog.TranslogStreams.readTranslogOperation(TranslogStreams.java:52)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:241)
... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:508)
... 6 more

Any leads or pointers to get further into debugging this would be great,

Thanks!

Any inputs would be helpful here? We have hit this a couple of times till now, and not sure how to progress on the same.

Thanks again!

  1. It's for each shard and for each operation, so it's atomic as possible. What do you mean by large files?
  2. Finding the underlying issue and fixing that. Perhaps turning on debug logging will help? Can you upgrade?

@warkolm - By upgrade u mean to 2.1?

At least off 1.3.X, there's way too many improvements between there and now.

thanks!