Elasticsearch primary and replica shards not sync after bulk load

Hi Guys,

My team has been using the elasticsearch on aws ec2 for searching for about 2 years. We are always bothered by the out-of-sync issue between primary and replica shards. In our elasticsearch cluster we have mainly 2 indices, each of which has 6 primary shards, and 2 replicas for each shards(namely 6 primary shards, 12 replica shards, totally 18 for every index). One index is used for searching, so it just has partial data but more fielddata in the mapping. Another one holds the full data but it is just used to query by id. We do bulk load regularly every monday to both of these 2 indices with the same dataset by our elasticsearch-consumer.

But the current issue is, after the bulk load with the latest data, we also do a bulk delete to those data which is not updated before the timestamp of beginning of bulk load. We keep this running a period of time, and then query by search-index/search-type/_search?sort=publishDate, will see a few docs published 1 or 2 month ago are still live in the index. I hit the stats API _stats?level=shards, and the results show the primary and replica shards have different count of docs.

Also, if I make the timestamp query, for different tries, the elasticsearch returns different results. Sometimes the total of results is 0, but sometimes it has 6 or 8 or more. But if I set the preference to _primary, the results is just 0, which is desirable. Correspondingly, if I change preference to _replica, I will see the results are more than 0.

All the foundings above show the fact that in our consumer, after the bulk load and bulk deletion(there are 15 mins interval between these 2 operations), the elasticsearch does not successfully sync up the different shards. I tried to run _flush/synced but it also fails because we keep indexing data in the meanwhile. It is not possible for us to pause and do the flush.

Does anyone have any thoughts about solving this issue? Thanks in advance.

What version are you on.

The current version is 5.1.1 but we are planning to upgrade to 6.2.3

anyone has any thoughts? Could it be a bug?

Are you changing the refresh interval during bulk load? If so, do you run a manual refresh once the bulk upload has completed?

We did not change the refresh interval in the bulk load. Yes we manually refresh the table and then run flush/sync but it failed because of some pending operations. The thing is I can even see the document which was published a few weeks ago. Not sure the reason why es failed to delete the in the deleting process followed by the bulk load.

Do you have any cluster or index settings that are not standard? How many nodes do you have in the cluster? What does your elasticsearch.yml file look like?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.