Documents not being replicated to replica shards

Hi,

I have a problem where for the same query executed multiple times, for some indices in my cluster, I get alternating different results (two sets of results alternating between each other with each subsequent request, and they are different as in "some documents appear in one result set and the others in the other one"). This started happening at one point and now seems to be happening for every new document that I index into the problematic indices. Some indices still work as expected, but I haven't been able to identify the differences that could cause the problem.

In our setup we have 2 nodes with number_of_shards: 6 and number_of_replicas: 1 for each index. Elastic version number: "6.5.2".
For some reason all primary shards are currently allocated on the master node, while the replicas are on the second node, which I find strange, but I'm not sure if it's related to the main problem.

When I set the preference parameter for the GET /_search request, for the non master node I get only the documents created up to a certain date and the new documents are missing, while for the master node I get the newly inserted documents as expected.

When I index a document into the problematic index and execute GET /_cat/shards before and after, I see the docs value and the seq_no.max value (Maximum sequence number.) increase just for the primary shard and not for the replica. On a different index where I don't observe this problem, I see that the docs and seq_no.max values are identical between primary and replica shards for the same shard number, which is not the case for any of the shards on the problematic index.

I also tried creating a new test index with the same shard/replica settings, where I indexed a couple of documents and they were all searchable for both nodes and returned consistent results when not setting the preference parameter.

My conclusion is that new documents don't get replicated to the replica shards, even though I get 2 successful shards in the response when indexing. This seemed like it might be the so called split-brain problem, but when I execute GET /_cat/master on both nodes I get the same result.

I was able to solve the problem by setting the replicas number to 0 and then back to 1 in order to recreate the replicas. The primary shards also get distributed evenly across the nodes during this process. Nevertheless, I would like to get to the root of the problem and avoid it in the future, so I wouldn't have to recreate the replicas every time it happens.

Please let me know if there is anything I can try in order to discover the root cause of this problem.

Are both nodes master eligible? If so, have you set minimum_master_nodes to 2 to avoid split brain scenarios? Do you have a refresh interval set that could affect when documents become searchable?

Both nodes are master eligible, minimum_master_nodes has the default value (I believe 1). I can try to set it to 2, but I'm not sure if split brain was the problem, because I see using GET /_cat/master that they agree on which one is the master at the moment when I index a new document, plus the issue doesn't exist if I create a new index or recreate the replicas.

The refresh_interval was set to 1s by default. The problem is that when I set the preference to the node where the replicas are, and sort the documents by date, I don't get any documents after around one month in the past, while if the preference is set to the node with the primary shards I do get the new documents, so I think it's not about a 1s/30s difference.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.