Results from different nodes with custom (string) value for preference=

Hi,

When passing a custom (string) value as a preference parameter to a search request, we get results from different nodes after repeating the same query (with the same custom value). It looks like the custom value is working as documented on some of our clusters, but not on others.

Here is a minimal test case that shows this behavior on one of our clusters:

    PUT preference_test
    PUT preference_test/_doc/test
    {
      "test":"test"
    }
    GET  preference_test/_search?preference=abc
    {
      "explain": true
    }

The _search query returns result sets from different nodes, as is clear with explain set to true. When we run the query multiple times, we sometimes get response A (node: bg7QeyVvTcWN4SnqKbtKng), sometimes B (node: FS4KODZJSG2lPq3tza8i0Q).

A:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_shard": "[preference_test][4]",
        "_node": "bg7QeyVvTcWN4SnqKbtKng",
        "_index": "preference_test",
        "_type": "_doc",
        "_id": "test",
        "_score": 1,
        "_source": {
          "test": "test"
        },
        "_explanation": {
          "value": 1,
          "description": "*:*",
          "details": []
        }
      }
    ]
  }
}

B:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_shard": "[preference_test][4]",
        "_node": "FS4KODZJSG2lPq3tza8i0Q",
        "_index": "preference_test",
        "_type": "_doc",
        "_id": "test",
        "_score": 1,
        "_source": {
          "test": "test"
        },
        "_explanation": {
          "value": 1,
          "description": "*:*",
          "details": []
        }
      }
    ]
  }
}

This is our cluster setup:

    {
      "name": "bg7QeyV",
      "cluster_name": "049947771584:test",
      "cluster_uuid": "9MxgRNC0Sk6MwMy1iOkulA",
      "version": {
        "number": "6.3.1",
        "build_flavor": "oss",
        "build_type": "zip",
        "build_hash": "e6df130",
        "build_date": "2018-08-01T07:23:36.254471Z",
        "build_snapshot": false,
        "lucene_version": "7.3.1",
        "minimum_wire_compatibility_version": "5.6.0",
        "minimum_index_compatibility_version": "5.0.0"
      },
      "tagline": "You Know, for Search"
    }

Is there anything in a cluster setting that influences this behavior?

This is unexpected, but I tried and failed to reproduce this on a 5-node 6.3.1 cluster with replicas on each node:

$ curl 'http://localhost:9200/_cat/shards'
i 0 p STARTED 1 3.3kb 127.0.0.1 node-3
i 0 r STARTED 1 3.3kb 127.0.0.1 node-1
i 0 r STARTED 1 3.3kb 127.0.0.1 node-4
i 0 r STARTED 1 3.3kb 127.0.0.1 node-2
i 0 r STARTED 1 3.3kb 127.0.0.1 node-0
$ for n in $(seq 3); do for i in 9200 9201 9202 9203 9204; do curl -s http://localhost:$i'/i/_search?preference=abc' -H 'Content-type: application/json' --data-binary $'{"explain":true}'; echo; done; done
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_shard":"[i][0]","_node":"hVgSuGbWTfaR566eWnA2Bw","_index":"i","_type":"_doc","_id":"test","_score":1.0,"_source":{"test":"test"},"_explanation":{"value":1.0,"description":"*:*","details":[]}}]}}

The ?preference option really fixes a preference order of the copies of each shard, because nodes may reject searches (e.g. if they are overloaded or disconnected from the coordinating node). The coordinating node starts with the first copy in the list but if this is rejected then it will try later ones. Are you only seeing this effect on busier clusters?

It's not too easy to observe this happening, but you could try setting this on the coordinating node:

logger.org.elasticsearch.action.search.TransportSearchAction: TRACE

Note that this will generate a lot of output and may affect performance if the coordinating node is serving a lot of other searches.

Thanks David, for trying to reproduce this. I'm not surprised you did not observe this. We only see it on one of our clusters as well. And that is not an extremely busy cluster either.

We'll see if we can get some more detail using the TRACE setting you suggested.

Just to check, are you observing that the search routing depends on which node is coordinating the search, or are you seeing inconsistent routing on a single coordinating node? Not that I can explain either case yet, but it'd help to narrow it down.

Unfortunately, we couldn't observe the TRACE logs for this, since we are running this cluster as an AWS Elasticsearch service. I don't think we can enable TRACE logging on that.

Also, I'm not sure how to check your last question. If I understand correctly, any node can be a coordinating node. Is there a way to specify a specific coordinating node, or a way to see which node was coordinating for a specific request? On the AWS cluster, we do not control all index settings, so it would have to be some sort of switch on the query itself.

I would love to reproduce this on a local machine where we control more of the settings and the logging, but as you saw already, this is an issue that is hard to reproduce on other hardware.

I see, yes, that does make it much harder to investigate. It's also worth noting that you aren't running a genuine build of Elasticsearch 6.3.1 there, so we can't know how it differs from the code in front of me. It should say something like this:

GET /

# 200 OK
# {
#   "cluster_name": "elasticsearch",
#   "tagline": "You Know, for Search",
#   "name": "937UgCY",
#   "version": {
#     "build_snapshot": false,
#     "build_hash": "eb782d0",
#     "minimum_index_compatibility_version": "5.0.0",
#     "build_flavor": "oss",
#     "minimum_wire_compatibility_version": "5.6.0",
#     "build_date": "2018-06-29T21:59:26.107521Z",
#     "number": "6.3.1",
#     "build_type": "tar",
#     "lucene_version": "7.3.1"
#   },
#   "cluster_uuid": "Z2not9iuSVy1FpNjrJTBiA"
# }

We are running a slightly different build, indeed. See my initial post for the full version description. Here's the diff with yours:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.