We are building an application where we want consistency over availability (in ES store). By consistency, I mean when we perform a search operation, the result should contain all the data that was indexed before this search. If the search result contains partial/stale data, the application should throw an error.
_When a shard fails to respond to a read request, the coordinating node will select another copy from the same replication group and send the shard level search request to that copy instead. Repetitive failures can result in no shard copies being available. In some cases, such as _search, Elasticsearch will prefer to respond fast, albeit with partial results, instead of waiting for the issue to be resolved (partial results are indicated in the shards header of the response).
Currently, we have taken the following steps -
Issue a refresh after indexing. Does a refresh bring that all the replicas in-sync with primary? We are also checking the response of refresh request to ensure the request has not failed for any shard. We will only proceed further if the response for refresh is _shards{_failed:0} in response.
When search is performed, we check the response to ensure, that there are no failed shards. By checking _shards{_failed:0}.
I have the following questions -
The above steps will guarantee consistency only if ES always reads from in-sync replica. If the response comes from a replica that is out-of-sync (stale), is there a way to know it. Or does search always happen on a replica that is in-sync.
Should we set 'wait_for_active_shards' to 'all'. Or is this an overkill and the above steps are enough to ensure consistency.
A refresh acts on all shard copies (primary and replicas) but only ensures that all copies have the same sets of visible documents if there is no ongoing indexing. It sounds like in your case this is true.
Elasticsearch may theoretically return results from a failed replica if both the coordinating node and the replica are unaware that the shard has been failed. This is independent of the wait_for_active_shards setting, because this is checked on the coordinating node too and may therefore also be stale.
What does "failed replica" mean here? Does it mean a shard that is unavailable at the time of search, or does it mean the shard has partial data?
Is there a way to know when this happens? We are checking the search response to ensure that we only use the result when failed shards is 0 (" _shards ":{"total":10,"successful":5," failed ": 0 }). Is this enough to ensure we are always seeing the fresh data?
Good question, "failed" is an overloaded term here.
Searches are served by started shards, and the set of such shards is stored in the cluster state. The act of removing a shard from this set is called "failing" it and this was what I meant: a shard that was started but which since has been failed.
The issue is that the coordinating node is not guaranteed to have the latest cluster state to hand when serving a search. It's very likely that it does in a healthy cluster, because failing a shard waits for an acknowledgement from every node before returning, but there is a timeout to this wait so if some nodes are slow to acknowledge then the absolute guarantee is lost.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.