It looks like, sometimes, a primary shard will be too large. This is before and after replication. Let me explain.
We create an index every hour. Each index has 13 primary shards, replication set to 0.
These shards are fairly large, say up to 25g in size, depending on the time of day.
Using the _cat/shards API, most of the primary shards, for any one index, are all reported to be the same size - something like 22.8gb. However, sometimes, one of the shards will be 29.3gb. How can that be?
Our system flushes indexes sometime after the top of the hour (when indexing is occurring in a new index). It also then enables replication on the index. When replicated, the replicas all have the same "correct" size - even the replica of the primary who is "too large". In the example above, the replica for the 29.3gb primary shard is the expected 22.8gb in size.
The top lines of the output of _cat/shards API looks like this:
index-2016.11.02.00 0 p STARTED 15479815 21.6gb 192.168.180.157 node1
index-2016.11.02.00 0 r STARTED 15479815 21.6gb 192.168.180.150 node2
index-2016.11.02.00 1 p STARTED 15493333 21.6gb 192.168.180.140 node3
index-2016.11.02.00 1 r STARTED 15493333 21.6gb 192.168.180.131 node4
index-2016.11.02.00 2 p STARTED 15489454 29.3gb 192.168.180.130 node5
index-2016.11.02.00 2 r STARTED 15489454 21.4gb 192.168.180.148 node6
index-2016.11.02.00 3 p STARTED 15496716 21.7gb 192.168.180.152 node7
index-2016.11.02.00 3 r STARTED 15496716 21.7gb 192.168.180.138 node8
The doc counts are the same, but the size for the one primary is not.
One guess is that compression is not on or working on node5. However, we have hundreds of shards on each node and not all primaries on this node are too large - and none of the replicas on the node are too large.