I have a problem where for the same query executed multiple times, for some indices in my cluster, I get alternating different results (two sets of results alternating between each other with each subsequent request, and they are different as in "some documents appear in one result set and the others in the other one"). This started happening at one point and now seems to be happening for every new document that I index into the problematic indices. Some indices still work as expected, but I haven't been able to identify the differences that could cause the problem.
In our setup we have 2 nodes with
number_of_shards: 6 and
number_of_replicas: 1 for each index. Elastic
version number: "6.5.2".
For some reason all primary shards are currently allocated on the master node, while the replicas are on the second node, which I find strange, but I'm not sure if it's related to the main problem.
When I set the preference parameter for the
GET /_search request, for the non master node I get only the documents created up to a certain date and the new documents are missing, while for the master node I get the newly inserted documents as expected.
When I index a document into the problematic index and execute
GET /_cat/shards before and after, I see the
docs value and the
seq_no.max value (Maximum sequence number.) increase just for the primary shard and not for the replica. On a different index where I don't observe this problem, I see that the
seq_no.max values are identical between primary and replica shards for the same shard number, which is not the case for any of the shards on the problematic index.
I also tried creating a new test index with the same shard/replica settings, where I indexed a couple of documents and they were all searchable for both nodes and returned consistent results when not setting the preference parameter.
My conclusion is that new documents don't get replicated to the replica shards, even though I get 2 successful shards in the response when indexing. This seemed like it might be the so called split-brain problem, but when I execute
GET /_cat/master on both nodes I get the same result.
I was able to solve the problem by setting the replicas number to 0 and then back to 1 in order to recreate the replicas. The primary shards also get distributed evenly across the nodes during this process. Nevertheless, I would like to get to the root of the problem and avoid it in the future, so I wouldn't have to recreate the replicas every time it happens.
Please let me know if there is anything I can try in order to discover the root cause of this problem.