We are on ES 2.4.4. We are using delete by query to delete all docs of given type in an index using delete by query. Immediately, we index the data again. Sometimes, doc count on replica is less than primary. I am using "_primary/_replica" preference to find out the counts. If we delete the entire index and index the data gain, things are fine.
In Pre-Prod, we have 2 Node cluster and in Prod we have 6 Node cluster. Issue happens on both environments, Each index has 2 shards and 1 replica . Can you please suggest what could the root cause and how to either troubleshoot or fix the issue?
Elasticsearch 2.4 is quite old and a lot of effort has gone into improving resiliency and durability in later versions. If I recall correctly, replication in Elasticsearch 2.x was asynchronous, so could be more susceptible to network issues. Is your cluster deployed within a single DC with fast and reliable connections between the nodes?
To be sure that you are indeed waiting for the job to complete and replication of the changes to finish, can you run the steps manually (verifying that they all have completed before continuing) and verify you see the same problem then?
All our machines are AWS EC2 instances in a single region but on different availability zones. As I mentioned above, we don't have this issue when index is deleted completely and indexed again. Issue only happens when we delete data for few types and added them back. So, I am assuming network connectivity is not an issue
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.