My team has been using the elasticsearch on aws ec2 for searching for about 2 years. We are always bothered by the out-of-sync issue between primary and replica shards. In our elasticsearch cluster we have mainly 2 indices, each of which has 6 primary shards, and 2 replicas for each shards(namely 6 primary shards, 12 replica shards, totally 18 for every index). One index is used for searching, so it just has partial data but more fielddata in the mapping. Another one holds the full data but it is just used to query by id. We do bulk load regularly every monday to both of these 2 indices with the same dataset by our elasticsearch-consumer.
But the current issue is, after the bulk load with the latest data, we also do a bulk delete to those data which is not updated before the timestamp of beginning of bulk load. We keep this running a period of time, and then query by search-index/search-type/_search?sort=publishDate, will see a few docs published 1 or 2 month ago are still live in the index. I hit the stats API _stats?level=shards, and the results show the primary and replica shards have different count of docs.
Also, if I make the timestamp query, for different tries, the elasticsearch returns different results. Sometimes the total of results is 0, but sometimes it has 6 or 8 or more. But if I set the preference to _primary, the results is just 0, which is desirable. Correspondingly, if I change preference to _replica, I will see the results are more than 0.
All the foundings above show the fact that in our consumer, after the bulk load and bulk deletion(there are 15 mins interval between these 2 operations), the elasticsearch does not successfully sync up the different shards. I tried to run _flush/synced but it also fails because we keep indexing data in the meanwhile. It is not possible for us to pause and do the flush.
Does anyone have any thoughts about solving this issue? Thanks in advance.