We have seen very occasional reports of this, and have been investigating, but it has proved extremely tricky for us to reproduce. We need help from someone like you who sees this problem regularly enough to be useful in diagnosis.
Please could you tell us more about this cluster and the environment in which it lives? For instance: what version are you running exactly? What is it running on? How frequently are you doing the bulk-delete-and-insert that you describe? What other activity does the cluster see?
Would you be willing to run the support diagnostics tool on your cluster and share the results? Don't post them here: I'll get you an email address to use if you can run this.
Would you be able to enable the following very verbose logging, and toggle the replica count to 0 and then back to 1 to make sure everything is in sync? I say again that this is very verbose so it will cause extra I/O and may fill up your disks, so proceed with caution here.
, "logger.org.elasticsearch.action.bulk": "TRACE"
, "logger.org.elasticsearch.cluster.service": "DEBUG"
, "logger.org.elasticsearch.indices.recovery": "TRACE"
, "logger.org.elasticsearch.index.shard": "TRACE"
In case it helps, we've only so far been able to reproduce anything like this by simulating some very strange networking failures that coincide with shards being reallocated, and even then it's very sporadic.