Translog has incomplete operation written for large documents

Mahathi · October 12, 2015, 9:51am

Hi,

We are using ElasticSearch 1.3.4 on Windows Azure IaaS VMs (using storage accounts/disks to house the actual data) and have seen this happen couple of times. The translog file fails to recover, with the following error, and when we dig deeper into the translog file, it seems like one of the document we wanted to add was really large (around 4 MB) and it seems like only part of the file was indexed, and the next document entry starts from there. This causes the version mismatch error as its trying to interpret the translog.

Question:

What guarantees exist in terms of translog operation entries being written atomically? Is it possible that for large files we can potentially hit this more often and something we should expect?
Is there an easier workaround than deleting the translog-*.recovering files, because that will require re-indexing all the affected docs again.

org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index_1][6] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:269)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.ElasticsearchException: failed to read [MyFileContract][XXX@FileName]
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:511)
at org.elasticsearch.index.translog.TranslogStreams.readTranslogOperation(TranslogStreams.java:52)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:241)
... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:508)
... 6 more

Any leads or pointers to get further into debugging this would be great,

Thanks!

Mahathi · October 15, 2015, 11:33am

Any inputs would be helpful here? We have hit this a couple of times till now, and not sure how to progress on the same.

Thanks again!

warkolm · October 15, 2015, 8:23pm

It's for each shard and for each operation, so it's atomic as possible. What do you mean by large files?
Finding the underlying issue and fixing that. Perhaps turning on debug logging will help? Can you upgrade?

mosiddi · December 27, 2015, 10:42am

@warkolm - By upgrade u mean to 2.1?

warkolm · December 27, 2015, 8:06pm

At least off 1.3.X, there's way too many improvements between there and now.

mosiddi · December 28, 2015, 3:30am

thanks!

Topic		Replies	Views
Translog files corrupted, cluster failing to recover Elasticsearch	2	1748	July 5, 2017
ES 7.5 translog recovery is extremely slow Elasticsearch	15	3535	February 19, 2020
Translog is corrupted Elasticsearch	3	3512	November 1, 2021
ES 2.1 shards stuck in translog recovery Elasticsearch	14	6069	July 5, 2017
ES 2.0.1 - Getting huge Translogs size which is not getting cleared up in one of default five shards. Which finally causes Out Of Memory -All Shard Failure Elasticsearch	2	558	August 15, 2017

Translog has incomplete operation written for large documents

Related topics