After upgrading all our production clusters to ES 6.2.4 from ES 5.6.10, we have on two instances on two separate clusters seen translog files for a shard grow much more than the configured 512MB
. Interestingly this has only happened while we were performing rolling upgrades for updating the host machines and not when the cluster is operating in a normal state. On both occasions, one shard ended up having more than 40GB of translog data! This resulted in OOM errors and eventually into a RED cluster state. One such error message is as under. We had to manually delete the translog folder to get back to a sane state.
[es-d56-rm] fatal error in thread [Thread-205], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:145) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:120) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:954) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Index.(Translog.java:931) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog$Operation.readOperation(Translog.java:883) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1432) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.BaseTranslogReader.read(BaseTranslogReader.java:103) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:73) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:64) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$2.next(PrimaryReplicaSyncer.java:134) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.doRun(PrimaryReplicaSyncer.java:234) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onResponse(PrimaryReplicaSyncer.java:212) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onResponse(PrimaryReplicaSyncer.java:180) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleResponse(TransportResyncReplicationAction.java:172) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleResponse(TransportResyncReplicationAction.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1091) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1139) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:401) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:379) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryResult.respond(TransportReplicationAction.java:466) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respondIfPossible(TransportWriteAction.java:176) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respond(TransportWriteAction.java:167) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$onResponse$0(TransportReplicationAction.java:357) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$2723/139779834.accept(Unknown Source) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.ReplicationOperation.finish(ReplicationOperation.java:267) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.ReplicationOperation.decPendingAndFinishIfNeeded(ReplicationOperation.java:248) ~[elasticsearch-6.2.4.jar:6.2.4]
Right now the rolling upgrade is complete on the recently affected cluster and we have a shard with a large translog directory (~1 GB). Is there any logging we can enable to debug further why we have so many translog files lying around?
We want to know how could we have landed in this state and if there are ways to prevent this from happening again?