Replicas synchronization

Right now we're able to recreate situation with ES 1.6 when short connectivity problem (not long enough for node to be disconnected from cluster) leads to temporary replicas desynchronization. During short period of time it is possible for used to get different results with the same search query. In our project we need to be absolutely sure that no such divergence appears.

To prevent this kind of situations we need a way to understand that all replicas are in sync.

We considered using document version, but we can't find a way to get document version from a specific (or from every) shard replica. Another idea was using shard versions, but we use NEST client, it's not released for ES 2.0 yet. As far as I know, shard version is only available via REST API since ES 2.0. And I'm not even sure this approach works since version is not incremented per operation.

So the question remains — is there a way to confirm replicas synchronization, even with a performance drawback?

Additionally, is there a way to get document from a specific replica?

Two potential solutions.

If your indexing is relatively light, it'd probably be easiest to check the sync_id marker on the indices. Synced flush API | Elasticsearch Guide [8.11] | Elastic

Synced Flush was added in 1.6 (although I don't remember which patch version, you'd have to check to see if your 1.6 supports it). When indexing halts for 5 minutes, the shards will synchronize and set a new "commit point" which is represented by the sync_id marker. If you compare the primary vs replica's sync_marker, you can determine if they are synced up.

However, this only works if your indexing isn't continuous, and ES has a chance to execute a synced flush.

Alternatively, you can ask ES for the "authoritative copy" of the document by always going to the primary shard. This can be done using the preference parameter: Request body search | Elasticsearch Guide [8.11] | Elastic

You can us the preference on all read requests (search, GET, etc). To always go to the primary, you'd do something like:

GET /_search?preference=_primary
{
    "query": {
        "match": {
            "title": "elasticsearch"
        }
    }
}

The downside is fairly obvious: you'll always be hitting the primary shard to get the data. So it is best to use this only when necessary, and allow your non-critical queries to hit the replicas. But this is really the only way to guarantee that a client will see the most up-to-date copy, since the primary will always represent the authoritative, most recent version of the data.

Yep, using the same preference param, you can specify ?preference=_shards:2 to get from the second shard, or something like ?preference=_only_node:xyz to get the shard from the node xyz, etc etc

Thank you very much for your answer!

We use sync flush heavily for our cluster restart, thank you for that feature, without it, restarting our 20 TB cluster would be total pain. But I totally missed the sync_marker.

Sync flush may do the deal, we only have direct writes to ES rarely, most of the data gets there via continuously working indexer that sends data in large bulks. I will try to fire sync flush request between two bulks and see if it works and if works fast enough.

Totally missed that preference too! So ashamed of myself right now, thank you a lot for clearing that out :slight_smile: Next time I totally re-check all the documentation before asking a question.

Great to see Elasticsearch grow and spread, you guys really do an amazing job!

By the way, right now ES doesn't store cluster state and that may lead to the situation, when after restart data was restored from stale replicas, because ones with the newer version didn't start yet. Are there any plans to fix it somehow or is it just the way ES is? Because obviously, persistent cluster state absence has it's advantages like flexibility. Since ES is all about being elastic, keeping it that way would be understandable.

https://www.elastic.co/guide/en/elasticsearch/guide/current/_important_configuration_changes.html#_recovery_settings
Maybe I missed something in the docs again? Right now we use preferences from the page above, all kinds of recovery_after... but that wouldn't help us is case when ES has started and index request was fired. That would increment running shards versions and automatically made those not started outdated, even if they have data current shards do not.

Woo, glad to hear! Restarts have always been painful, we were excited to hear that synced flush is helping a lot of people. Previously we'd hear of clusters taking literally a week to perform a full rolling restart due to reshuffling shards.

No worries! ES is a complex piece of software, it's unrealistic for users to know all the nooks and crannies (especially stuff like preference, which has a pretty niche use). That's why we hang out on the forums :smile:

Yep, good question. Currently the only way to guarantee that is to have recover_after_nodes set to the size of your entire cluster, so you can guarantee all the existing primaries come online at the same time.

You'd may also want to block operations until the cluster health has gone to green (e.g. via a _cluster/health?wait_for_status=green call or similar), to ensure that all shards have started completely, although that shouldn't be a problem (if the primary is "online" but still INITIALIZING, the indexing command will block anyway).

Work is ongoing to persist the "active" shards to the cluster state, just like you suggested. That'd allow the cluster to automatically use the last active primaries, instead of potentially promoting stale replicas.

Looks like it is slated for 3.0...you can follow the progress here:

Big thanks for that :slight_smile: It feels amazing to have you guys here ready to help!

Wow, great! Thanks for the link, sure I will follow the progress carefully. As always, great to hear that ES is getting better and better every day!