Our API uses Elasticsearch to return results based on scoring (relevance). In our use case, it's important we consistently return documents in the same sort order. We currently see that queries with larger result sets (approx. > 500 documents) return inconsistent scoring on successive runs despite no changes being made to the Elasticsearch indices.
The elasticsearch documentation suggests scores are not reproducible and "The recommended way to work around this issue is to use a string that identifies the user that is logged in (a user id or session id for instance) as a preference. This ensures that all queries of a given user are always going to hit the same shards, so scores remain more consistent across queries."
However, despite using something like
preference: foo and
search_type: dfs_query_then_fetch in the query, we're still receiving inconsistent scoring, and because of this our API results are not ordered deterministically from request to request.
The cluster we're working with is relatively simple. It has two nodes--the primary shard for the index in question lives on node A and the replica lives on node B. When we specify a
_prefer_nodes setting or the now-deprecated
_primary_first in the
preference query, we seem to receive the consistent scoring/sort-ordering we're looking for.
We would expect that using the documentation-prescribed approach of
preference: <arbitrary_string> would resolve the scoring inconsistency, and we'd prefer not having to layer on application-level logic for detecting which nodes have serviced queries with particular parameters and then specifying the node that has historically served a request using
_prefer_nodes in order to get the consistent sort order.
Can someone help us to better understand why
preference isn't working for us in the way we expect and if there's a more generally-accepted way of achieving consistent sort order via query definition or cluster configuration?