Replica Shards Much Larger Than Primaries

Hi,

I've been seeing some strange behaviour in 6.2.3 with replica shards being x2/3/4 bigger than their corresponding primaries (Replicas: 1):

An example from /_cat/shards:

applicationlogs-2018.04.04 2 p STARTED 21977491 29.5gb 10.187.13.6 elasticsearch-data-hot-3
applicationlogs-2018.04.04 2 r STARTED 21985312 55.3gb 10.187.8.6  elasticsearch-data-hot-4

When this has been seen, the replicas continue to grow at a faster rate than their primaries which results in bloated disks.

Link to the stats for an example index: https://gist.github.com/Evesy/fab61e87cd277c56e8d5bfa62a3b3feb Doc counts look about right, not sure if there's anything else in there that points towards the issue. This is happening on more than one index -- All the indices are ingest only (this example being ~2k/sec), nothing is ever deleted.

Also output of /_segments: https://gist.github.com/Evesy/14db575e6ec7c39a627de386633189ff (Worth noting this is also occurring on indices without best_compression enabled)

Looks like a similar issue is outlined here: Strange size difference in shards since upgrading to 6.2

Any help would be greatly appreciated!

Cheers,
Mike

1 Like

Could you provide the output of the following?

GET /applicationlogs-2018.04.04/_stats?level=shards

Also if you rebuild the replica (by setting number_of_replicas to 0 and then back to 1) is the rebuilt version also twice as large?

Hi David,

When we've been rebuilding the replicas it does appear to come back at roughly the same size as the primary shards -- When we've been rebuilding it's been after the index has stopped ingesting and a force merge is performed shortly after so it's difficult to say for sure whether the issue would come back.

I'll try get the shard stats output for you when I'm back online, however since previous indices have been merged down to 1 segment per shard I'm guessing the output of those will be unhelpful? -- I'll try and gather stats from a newer index that's currently experiencing the issue if so

Yes, if you've cleaned up the problem on that index and the issue recurs on newer indices then please grab the stats from one exhibiting the problem instead.

I've collected some new stats for an index that's showing the problem:

applicationlogs-2018.04.07 5 r STARTED 20439597 14.7gb 10.187.9.25  elasticsearch-data-hot-1
applicationlogs-2018.04.07 5 p STARTED 20441468 14.5gb 10.187.21.7  elasticsearch-data-hot-5
applicationlogs-2018.04.07 2 p STARTED 20442814 31.1gb 10.187.9.25  elasticsearch-data-hot-1
applicationlogs-2018.04.07 2 r STARTED 20444535 64.5gb 10.187.11.27 elasticsearch-data-hot-2
applicationlogs-2018.04.07 4 p STARTED 20437839 14.5gb 10.187.13.19 elasticsearch-data-hot-3
applicationlogs-2018.04.07 4 r STARTED 20441808 14.2gb 10.187.21.7  elasticsearch-data-hot-5
applicationlogs-2018.04.07 3 p STARTED 20436107 24.5gb 10.187.11.27 elasticsearch-data-hot-2
applicationlogs-2018.04.07 3 r STARTED 20440543 57.3gb 10.187.13.19 elasticsearch-data-hot-3
applicationlogs-2018.04.07 1 p STARTED 20440523   26gb 10.187.10.25 elasticsearch-data-hot-0
applicationlogs-2018.04.07 1 r STARTED 20444270 60.5gb 10.187.8.21  elasticsearch-data-hot-4
applicationlogs-2018.04.07 0 r STARTED 20433938 14.6gb 10.187.10.25 elasticsearch-data-hot-0
applicationlogs-2018.04.07 0 p STARTED 20433435 13.6gb 10.187.8.21  elasticsearch-data-hot-4

I've created some gists:

Index Stats
Segments
Index Shard Stats

Might be worth mentioning we're not using any custom document routing, since I notice the primary shards have somewhat varying sizes.

Thanks,
Mike

I see two (possibly related) things in this excerpt from the shard-level stats. The inflated shards have significantly higher flush counts, and a local checkpoint that lags the maximum sequence number by a long way. The lagging local checkpoint means Elasticsearch keeps hold of a lot of translog, and indeed the translog.uncommitted_size_in_bytes statistics correlate well with the difference between the sizes of primary and replica. The total size of the translog is much larger on the problematic shards too, on both primary and replica, because the lagging local checkpoint prevents the global checkpoint from advancing, which in turn prevents old parts of the translog from being cleaned up, and I think this goes some way to explaining the imbalance in sizes between the different primaries, particularly since their document counts are so close.

It is possible that this is fixed by #29125: the high flush counts point in that direction, but I'm not sure that explains the lagging local checkpoint.

I am not going to be available to look in more depth at this for the next week or so but will raise it with the team. In the meantime, it looks like rebuilding the replicas should get everything back on track.

The excerpt from the stats was obtained as follows:

$ cat 20180407_index_shards_stats.json | jq '.indices["applicationlogs-2018.04.07"].shards | map_values (map({primary:.routing.primary,translog,seq_no,flush}))'
2 Likes

Thanks for looking into this!

Interestingly it was due to #28350 that we upgraded to from 6.1.2, as we believed we were suffering from the symptoms of that described in another post somewhere on this forum.

I can confirm rebuilding replicas is a suitable workaround for the time being

Cheers,
Mike

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.