Replica Shards Much Larger Than Primaries

I see two (possibly related) things in this excerpt from the shard-level stats. The inflated shards have significantly higher flush counts, and a local checkpoint that lags the maximum sequence number by a long way. The lagging local checkpoint means Elasticsearch keeps hold of a lot of translog, and indeed the translog.uncommitted_size_in_bytes statistics correlate well with the difference between the sizes of primary and replica. The total size of the translog is much larger on the problematic shards too, on both primary and replica, because the lagging local checkpoint prevents the global checkpoint from advancing, which in turn prevents old parts of the translog from being cleaned up, and I think this goes some way to explaining the imbalance in sizes between the different primaries, particularly since their document counts are so close.

It is possible that this is fixed by #29125: the high flush counts point in that direction, but I'm not sure that explains the lagging local checkpoint.

I am not going to be available to look in more depth at this for the next week or so but will raise it with the team. In the meantime, it looks like rebuilding the replicas should get everything back on track.

The excerpt from the stats was obtained as follows:

$ cat 20180407_index_shards_stats.json | jq '.indices["applicationlogs-2018.04.07"].shards | map_values (map({primary:.routing.primary,translog,seq_no,flush}))'
2 Likes