Replicas synchronization

Crassirostris · December 14, 2015, 6:43am

Right now we're able to recreate situation with ES 1.6 when short connectivity problem (not long enough for node to be disconnected from cluster) leads to temporary replicas desynchronization. During short period of time it is possible for used to get different results with the same search query. In our project we need to be absolutely sure that no such divergence appears.

To prevent this kind of situations we need a way to understand that all replicas are in sync.

We considered using document version, but we can't find a way to get document version from a specific (or from every) shard replica. Another idea was using shard versions, but we use NEST client, it's not released for ES 2.0 yet. As far as I know, shard version is only available via REST API since ES 2.0. And I'm not even sure this approach works since version is not incremented per operation.

So the question remains — is there a way to confirm replicas synchronization, even with a performance drawback?

Additionally, is there a way to get document from a specific replica?

polyfractal · December 14, 2015, 7:18pm

Two potential solutions.

If your indexing is relatively light, it'd probably be easiest to check the sync_id marker on the indices. Synced flush API | Elasticsearch Guide [8.11] | Elastic

Synced Flush was added in 1.6 (although I don't remember which patch version, you'd have to check to see if your 1.6 supports it). When indexing halts for 5 minutes, the shards will synchronize and set a new "commit point" which is represented by the sync_id marker. If you compare the primary vs replica's sync_marker, you can determine if they are synced up.

However, this only works if your indexing isn't continuous, and ES has a chance to execute a synced flush.

Alternatively, you can ask ES for the "authoritative copy" of the document by always going to the primary shard. This can be done using the preference parameter: Request body search | Elasticsearch Guide [8.11] | Elastic

You can us the preference on all read requests (search, GET, etc). To always go to the primary, you'd do something like:

GET /_search?preference=_primary
{
    "query": {
        "match": {
            "title": "elasticsearch"
        }
    }
}

The downside is fairly obvious: you'll always be hitting the primary shard to get the data. So it is best to use this only when necessary, and allow your non-critical queries to hit the replicas. But this is really the only way to guarantee that a client will see the most up-to-date copy, since the primary will always represent the authoritative, most recent version of the data.

Yep, using the same preference param, you can specify ?preference=_shards:2 to get from the second shard, or something like ?preference=_only_node:xyz to get the shard from the node xyz, etc etc

Crassirostris · December 15, 2015, 3:56am

Thank you very much for your answer!

We use sync flush heavily for our cluster restart, thank you for that feature, without it, restarting our 20 TB cluster would be total pain. But I totally missed the sync_marker.

Sync flush may do the deal, we only have direct writes to ES rarely, most of the data gets there via continuously working indexer that sends data in large bulks. I will try to fire sync flush request between two bulks and see if it works and if works fast enough.

Totally missed that preference too! So ashamed of myself right now, thank you a lot for clearing that out Next time I totally re-check all the documentation before asking a question.

Great to see Elasticsearch grow and spread, you guys really do an amazing job!

By the way, right now ES doesn't store cluster state and that may lead to the situation, when after restart data was restored from stale replicas, because ones with the newer version didn't start yet. Are there any plans to fix it somehow or is it just the way ES is? Because obviously, persistent cluster state absence has it's advantages like flexibility. Since ES is all about being elastic, keeping it that way would be understandable.

https://www.elastic.co/guide/en/elasticsearch/guide/current/_important_configuration_changes.html#_recovery_settings
Maybe I missed something in the docs again? Right now we use preferences from the page above, all kinds of recovery_after... but that wouldn't help us is case when ES has started and index request was fired. That would increment running shards versions and automatically made those not started outdated, even if they have data current shards do not.

polyfractal · December 15, 2015, 1:16pm

Woo, glad to hear! Restarts have always been painful, we were excited to hear that synced flush is helping a lot of people. Previously we'd hear of clusters taking literally a week to perform a full rolling restart due to reshuffling shards.

No worries! ES is a complex piece of software, it's unrealistic for users to know all the nooks and crannies (especially stuff like preference, which has a pretty niche use). That's why we hang out on the forums

Yep, good question. Currently the only way to guarantee that is to have recover_after_nodes set to the size of your entire cluster, so you can guarantee all the existing primaries come online at the same time.

You'd may also want to block operations until the cluster health has gone to green (e.g. via a _cluster/health?wait_for_status=green call or similar), to ensure that all shards have started completely, although that shouldn't be a problem (if the primary is "online" but still INITIALIZING, the indexing command will block anyway).

Work is ongoing to persist the "active" shards to the cluster state, just like you suggested. That'd allow the cluster to automatically use the last active primaries, instead of potentially promoting stale replicas.

Looks like it is slated for 3.0...you can follow the progress here:

github.com/elastic/elasticsearch

Allocate primary shard based on allocation IDs

opened 02:02PM - 13 Nov 15 UTC

closed 01:34PM - 29 Feb 16 UTC

ywelsch

>enhancement release highlight resiliency Meta v5.0.0-alpha1 :Distributed/Allocation

Persist allocation IDs of active shards in cluster state and use them to recover… correct shards upon cluster restart (i.e., we can recover with only one copy and we make sure we recover the right ones and not a stale copy left around). Relates to #14671 ## Steps - [x] Persist allocation ID with shard state metadata on nodes (#14831). - [x] Persist currently started allocation IDs to index metadata (#14964). - [x] Add allocation IDs to TransportNodesListGatewayStartedShards action. Use the above to assign a primary shard on recovery. In particular, don't assign a stale copy (#15281). - [x] Extend reroute with an option to force assign a stale copy. This is only relevant for primaries (#15708). - [x] Prioritize old primaries when choosing out existing _and_ previously active allocationIds (#16096). - [x] Persist shard state immediately and not only on shard activation. This will help solving a race condition on primary relocation where the master has activated the target shard (removing old copy from the active set list) but the cluster crashes before the target node persisted the shard state (#16625) - [x] Remove version from ShardRouting (#16243).

Crassirostris · December 15, 2015, 1:37pm

Big thanks for that It feels amazing to have you guys here ready to help!

Wow, great! Thanks for the link, sure I will follow the progress carefully. As always, great to hear that ES is getting better and better every day!

Topic		Replies	Views
Shard synchronization behaviour Elasticsearch	4	586	May 16, 2017
Replicas out of sync Elasticsearch	19	4777	February 28, 2018
Does elasticsearch 6.4 perform a search on a replica that is out-of-sync/stale? Elasticsearch	5	876	November 22, 2018
Consistency between multiple _search requests Elasticsearch	1	389	April 13, 2018
Nodes Out of Sync Elasticsearch	7	3507	January 5, 2018

Replicas synchronization

Related topics