Nodes Out of Sync

Hi,

We have several different clusters of ES5.02 running. Most are two nodes. On one, there's a large index with three shards, one replica. What I've noticed is that when queried, I'm getting different search results depending on which node ES decides to draw data from. This prompted me to look at other clusters, and I found the same behavior. I think it was less noticeable because the dataset was smaller.

So, I've been researching about how to keep nodes in sync, and what might cause then to go out of sync. All I've found so far is that ES automagically keeps nodes in sync. This obviously isn't happening.

To try and solve this, I thought maybe there was a damaged replica, so i set replicas to 0, then set it back to one. The nodes were in sync for barely minutes before falling out of sync again.

The whole point of having multiple nodes is for redundancy, which isn't the case if one node fails and it has an incomplete or incorrect data set.

What is failing here? How do I fix it? Is there a manual command that I could put in a cron to keep the things synced?

Thanks in advanced for all ideas and explanations. Feel free to ask for any info I've left out here.

You are right. That is the reason why ES works so well. :slight_smile:

You should never need to do that :slight_smile:

When that happened, was there anything in the logs of any of nodes that might be relevant? And if so, could you please post that as well :slight_smile:

Could you please post your elasticsearch.yml(with private info edited out, of course) and also, what environment are you running the clusters in?

I get that, but assuming I DID need to do that, as in the instance I'm reporting, how could I?

Environment: Azure, Ubuntu 16.04

It seems there was an issue that was causing ES to restart that was happening so quickly that our monitoring wasn't catching it. I believe this was causing the out of sync as mentioned here: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

It says that "This is caused by operations that were in-flight when the primary shard failed and may not have been processed on all replica shards. Currently, the discrepancies are not repaired on primary promotion but instead would be repaired if replica shards are relocated"

So, since I'm not relocating shards, I'd need a way to manually trigger a resync.

Never do this. It'll break things as you are working against how Elasticsearch is designed to work.

What is out of sync exactly? How are you measuring this?
What sort of data is it? What does your query and results look like?

As to the queries, they're created by a different team. The data is Product info.

When I do a search on the site, either with criteria or an "empty search", the results change back and forth. It's harder to notice on a high traffic site, but on a newer one, it's really easy to spot. It goes back and forth between two distinct sets of products displayed, depending on which node ES has decided to route the traffic to.

I've seen docs about flush and Synced Flush. Assuming I ran the sync one at a time when we know we're not importing new info, would this help to actually sync the nodes, or does it only deal with flushing memory to disk?

You need to show exactly what you mean by out of sync. Providing your queries and responses is important, because search is relevant and both of these impact results. It could be something as simple as differences in document counts across shards, which not a fundamental problem for the operations of Elasticsearch, but will impact relevance and can be countered.

It's kinda hard to help here unless we can get direct answers. Flushing won't help.

I'll talk to one of the guys in Search and get back here asap. Likely Monday unfortunately.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.