Extremely large translog files per shard in Elasticsearch 6.2.4

ywelsch · March 28, 2019, 8:35am

This looks to be related to https://github.com/elastic/elasticsearch/pull/40433, and in particular an issue that we discovered a few days ago. When upgrading from 5.x to 6.x in a rolling fashion, and a 5.x node with a primary is taken down, a replica on the 6.x nodes gets promoted to primary, and this currently triggers asynchronously resending the full content of the translog from the new primary to the replicas (the goal of this functionality in 6.x is to fully realign replicas with primaries in case of a primary failover), which is called the primary-replica resync. If this resync is happening for many shards in parallel, it might take a while to complete, and if there is ongoing indexing activity, the translog can grow in size as it will hold on to all newly indexed requests until the resync is completed. The bug is that in a rolling upgrade from 5.x, ES does a resync of the full translog of the primary instead of just the portion that's needed to realign the replica with the primary.
We're currently looking at fixes for this /cc: @nhat

To reduce the likelihood of this problem in the meanwhile is to, as @DavidTurner pointed out, reduce indexing activity during the rolling upgrade.

Topic		Replies	Views
Elasticesearch 7.6.2 translog overflow issue Elasticsearch	10	871	December 28, 2020
ES crashing with OutOfMemory exception while reading Translog of 450 GB Elasticsearch	2	692	January 23, 2019
Why my expired translog files are not deleted? Elasticsearch	5	1089	November 25, 2019
Translog don't decrease quickly enough Elasticsearch	10	745	January 15, 2020
Large translog files using 6.0 stack Elasticsearch	10	3001	January 3, 2018

Extremely large translog files per shard in Elasticsearch 6.2.4

Related topics