Extremely large translog files per shard in Elasticsearch 6.2.4

After upgrading all our production clusters to ES 6.2.4 from ES 5.6.10, we have on two instances on two separate clusters seen translog files for a shard grow much more than the configured 512MB. Interestingly this has only happened while we were performing rolling upgrades for updating the host machines and not when the cluster is operating in a normal state. On both occasions, one shard ended up having more than 40GB of translog data! This resulted in OOM errors and eventually into a RED cluster state. One such error message is as under. We had to manually delete the translog folder to get back to a sane state.

[es-d56-rm] fatal error in thread [Thread-205], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:145) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:120) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:954) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:931) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Operation.readOperation(Translog.java:883) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1432) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.BaseTranslogReader.read(BaseTranslogReader.java:103) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:73) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:64) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$2.next(PrimaryReplicaSyncer.java:134) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.doRun(PrimaryReplicaSyncer.java:234) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onResponse(PrimaryReplicaSyncer.java:212) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onResponse(PrimaryReplicaSyncer.java:180) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleResponse(TransportResyncReplicationAction.java:172) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleResponse(TransportResyncReplicationAction.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1091) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1139) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:401) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:379) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryResult.respond(TransportReplicationAction.java:466) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respondIfPossible(TransportWriteAction.java:176) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respond(TransportWriteAction.java:167) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$onResponse$0(TransportReplicationAction.java:357) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$2723/139779834.accept(Unknown Source) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.ReplicationOperation.finish(ReplicationOperation.java:267) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.ReplicationOperation.decPendingAndFinishIfNeeded(ReplicationOperation.java:248) ~[elasticsearch-6.2.4.jar:6.2.4]

Right now the rolling upgrade is complete on the recently affected cluster and we have a shard with a large translog directory (~1 GB). Is there any logging we can enable to debug further why we have so many translog files lying around?

We want to know how could we have landed in this state and if there are ways to prevent this from happening again?

6.2.4 is quite an old release in the 6.x series. I suspect you might be hitting something like #30244. I suggest you upgrade to a more recent version and report back if this does not resolve it.

@DavidTurner We do not perform any mapping update on the index which is hitting this issue. So I don't think the issue you mentioned is related to this scenario.

Ok, perhaps it's something else. Regardless, I recommend upgrading to a newer version.

Were you indexing heavily during your rolling upgrade? Elasticsearch will retain the translog while recoveries are ongoing, so it's not that surprising that extra translog was retained during this process.

When you hit the OOM error Elasticsearch will have written a heap dump. Does this give you any information about why it went OOM? Retaining a lot of translog doesn't normally result in an excess of heap usage.

How long has the cluster been green since the completion of the rolling upgrade? Can you share the output of GET /INDEXNAME/_stats?level=shards for the affected index?

This looks to be related to https://github.com/elastic/elasticsearch/pull/40433, and in particular an issue that we discovered a few days ago. When upgrading from 5.x to 6.x in a rolling fashion, and a 5.x node with a primary is taken down, a replica on the 6.x nodes gets promoted to primary, and this currently triggers asynchronously resending the full content of the translog from the new primary to the replicas (the goal of this functionality in 6.x is to fully realign replicas with primaries in case of a primary failover), which is called the primary-replica resync. If this resync is happening for many shards in parallel, it might take a while to complete, and if there is ongoing indexing activity, the translog can grow in size as it will hold on to all newly indexed requests until the resync is completed. The bug is that in a rolling upgrade from 5.x, ES does a resync of the full translog of the primary instead of just the portion that's needed to realign the replica with the primary.
We're currently looking at fixes for this /cc: @nhat

To reduce the likelihood of this problem in the meanwhile is to, as @DavidTurner pointed out, reduce indexing activity during the rolling upgrade.

1 Like

Thank you @ywelsch for your response. It does seem like we have hit the issue you mentioned. We are done upgrading all our 5.x clusters to 6.x. We are currently on 6.2.4. Will we hit this issue again when we upgrade to a minor version of Elasticsearch 6.x?

Will we hit this issue again when we upgrade to a minor version of Elasticsearch 6.x?

no, this should not be an issue for an upgrade from 6.x. It only affects upgrades from 5.x to 6.x.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.