We create an index once per hour. We enable replication on the previous hour's index sometime in the current hour. We wait for all merges to complete, and flush the index before enabling replication. There are no updates nor deletes to documents, we only bulk ingest.
Once this completes, normally, all replica shards match the primaries in size. However, sometimes, the replicas do NOT match the primaries in size.
For these indices, I see that the translog is large. However, before we enable replication, we flush and verify the translog is gone (559 bytes is what we have left normally). And, after replication is complete, we check the translog again, and its the same.
Yet, somehow, sometime later, after replication is enabled and complete, we see that the primary shard is unusually large and the replica does NOT match its size.
We do see an error in the logs on the replica node.
[2017-01-06 12:26:22,244][DEBUG][action.admin.indices.stats] [pryor] [indices:monitor/stats] failed to execute operation for shard [[lognado-2017.01.06.11], node[iCo2w8dCTXmaHgwymNT-mg], [R], v, s[INITIALIZING], a[id=a8kBDs3YQ6OL2S_OLXd1eQ], unassigned_info[[reason=REPLICA_ADDED], at[2017-01-06T12:16:12.920Z]]]
[lognado-2017.01.06.11][[lognado-2017.01.06.11]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
That is the only message on either the primary, replica or master nodes regarding that shard.
Any ideas would be great.