Missing data from replica shards after delete by query and index


#1

Hi All,

We are on ES 2.4.4. We are using delete by query to delete all docs of given type in an index using delete by query. Immediately, we index the data again. Sometimes, doc count on replica is less than primary. I am using "_primary/_replica" preference to find out the counts. If we delete the entire index and index the data gain, things are fine.

In Pre-Prod, we have 2 Node cluster and in Prod we have 6 Node cluster. Issue happens on both environments, Each index has 2 shards and 1 replica . Can you please suggest what could the root cause and how to either troubleshoot or fix the issue?

Thank you
Ashok


(Christian Dahlqvist) #2

Have you waited for the operation to complete and run a refresh before getting the count?


#3

Yes, we do wait and refresh. Here is the exact code we are using

					DeleteByQueryResponse rsp = new DeleteByQueryRequestBuilder(client, DeleteByQueryAction.INSTANCE)
													.setIndices(INDEX)
													.setTypes(TYPE)
													.setSource(new SearchSourceBuilder().query(QueryBuilders.matchAllQuery()).size(5000).toString())
													.execute()
													.actionGet();
				RefreshResponse refreshResponse = client.admin().indices().refresh(new RefreshRequest(INDEX)).actionGet();

#4

Couple of observations

  1. Instead of delete by query, I scroll through all docs and delete using bulk request and still same issue is seen
  2. We are not relying on auto generated doc id
  3. If we completely delete the index and re-index the whole data, there are no issues

Please suggest what could be going wrong for us? Thank you so much


(Christian Dahlqvist) #5

Elasticsearch 2.4 is quite old and a lot of effort has gone into improving resiliency and durability in later versions. If I recall correctly, replication in Elasticsearch 2.x was asynchronous, so could be more susceptible to network issues. Is your cluster deployed within a single DC with fast and reliable connections between the nodes?

To be sure that you are indeed waiting for the job to complete and replication of the changes to finish, can you run the steps manually (verifying that they all have completed before continuing) and verify you see the same problem then?


#6

Thank you Christian for you response.

All our machines are AWS EC2 instances in a single region but on different availability zones. As I mentioned above, we don't have this issue when index is deleted completely and indexed again. Issue only happens when we delete data for few types and added them back. So, I am assuming network connectivity is not an issue

I will explore the manual option.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.