FilterAllocationDecider routing exclude setting behavior

Elasticsearch version (Verified on 6.8 and 7.4.2):
Description of the problem including expected versus actual behavior :
I was exploring the FilterAllocationDecider behavior with cluster.routing.allocation.exclude setting, specially how the updates of these setting group works. I found that the behavior is not as per my expectation and think it is a bug in ES.

Let's say there are 2 attributes "attr_1" and "attr_2" for which we want to put the exclusion filter for FilterAllocationDecider to consider. FilterAllocationDecider has a consumer registered for the settings in cluster.routing.allocation.exclude group. The expectation is with each update setting request for this group, the clusterExcludeFilters in FilterAllocationDecider will be updated such that keeps all the exclusion applied so far. However, it looks like it only keeps the difference in previous and current settings for this group.

Steps to reproduce :
Case 1:

  • Request 1: set the value of only "attr_1" to be "1"
    • Verified clusterExcludeFilter in FilterAllocationDecider has key: "attr_1" with value: "1"
  • Request 2: set the value of only "attr_2" to be "abc"
    • Expectation is clusterExcludeFilter in FilterAllocationDecider will have 2 keys "attr_1" and "attr_2" with value "1" and "abc" respectively. However, the second request removes the first key "attr_1" and only keep the second key "attr_2" in the clusterExcludeFilter

Case 2:

  • Request 1: set the value of "attr_1" to be "1" and "attr_2" to be "abc"
    • Verified clusterExcludeFilter in FilterAllocationDecider has key: "attr_1" with value: "1" and key "attr_2" with value: "abc"
  • Request 2: set the value of "attr_1" to be "1" and "attr_2" to be "def"
    • Expectation is clusterExcludeFilter in FilterAllocationDecider will have 2 keys "attr_1" and "attr_2" with value "1" and "def" respectively. However, the second request removes the first key "attr_1" and only keep the second key "attr_2" with value "def" in the clusterExcludeFilter

I have created a test here for various scenarios showing the behavior:

Details:
For cluster.routing.allocation.exclude setting the FilterAllocationDecider registers a setting updater created using newAffixMapUpdater. This updater checks for the difference in settings for this group between current and previous settings and calls the consumer with the diff of the setting. There is a local result map which is populated with the changed setting and passed to the consumer here.

Proposal:

  • The result map in the newAffixMapUpdater should be a member variable rather than local variable, so that it keeps all the changed settings so far and always return that to the consumer.
  • FilterAllocationDecider keeps adding/updating the settings as and when updated instead of reinitializing the new DiscoveryNodeFilters for each setting update

I think your observations are correct but your expectations are wrong, and this code is behaving as expected. Can you demonstrate that this is a bug using the REST API rather than using the handful of internal APIs that you are focussing on here?

@DavidTurner - Thanks for your reply. I created the test to easily show the behavior based on my observation while trying with REST API. I can try and share all the information by simulating the same test using REST API as well. However, before doing that are you saying that the exclusion filters should behave in a way such that already existing filter will be removed and only diff between previous and current filter will be used with any update ? This means that exclusion settings update is not idempotent in a way right ? Because, if I have a exclusion for node attribute "attr_1" with value "1" and another update settings again put exclusion for same "attr_1" and value "1", this will remove the exclusion completely.

Ok, if you're seeing something strange with the REST API then please share the details of that instead. No need for a formal test yet, just a sequence of commands with a surprising outcome. It's quite some effort to translate this kind of low-level test back to the REST layer.

Replying to your comment on GitHub:

I am new to the community so just wanted to know if the process is that based on discussion in the forum it will be decided if this is a bug or not ?

Not exactly, but we'll try and help you to understand what you're seeing and maybe to come up with a clearer way of describing the issue for a bug report.

The problem with REST layer test is I cannot show the internals of FilterAllocationDecider and how it is computing the exclusion filter setting for using it to verify if a Node is eligible for shard allocation or not. That's why I shared the formal test. Steps to reproduce are as follows:

  1. Cluster has 2 nodes, node1 with attribute "_ip" set to "10.10.10.1" and node2 with "_ip" set to "10.10.10.2"
  2. Put the exclusion setting at cluster level for FilterAllocationDecider to not allocate any shard on node1. For example used below command:
PUT _cluster/settings
{
  "transient" : {
    "cluster.routing.allocation.exclude._ip" : "10.10.10.1"
  }
}
  1. Create an index with large shard count like 12 and verify that shards are only allocated on node 2 because of exclusion in step 2.
  2. Again apply the same setting using the REST api as:
PUT _cluster/settings
{
  "transient" : {
    "cluster.routing.allocation.exclude._ip" : "10.10.10.1"
  }
}
  1. BalancedShardAllocator will do the rebalance based on weights and allocate some shards to node 1 from node 2. But since the exclusion setting is applied for node 1 that should not be eligible to have any shards.

Reason:
Internally FilterAllocationDecider will remove all the exclusion due to update in step 4 since there is no diff between previous exclusion in step 2 and new exclusion request in step 4. But when you query the cluster setting then the exclusion can still be seen.

Also I can see another github issue reported here which explains the same behavior with a different scenario:

I'm still really struggling to understand exactly how to reproduce what you're seeing. I tried the sequence of steps that you described but everything works as expected for me. The only difference is that I'm using the node name rather than its _ip since that's much easier for me to test. I share a detailed transcript below. Can you please provide the exact commands you are using rather than trying to describe them in words? Words don't seem to be precise enough.

GET /_cat/allocation?v

# 200 OK
# shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
#      0           0b   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-1
#      0           0b   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-0
# 

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude.name": "node-0"
  }
}

# 200 OK
# {
#   "persistent": {},
#   "acknowledged": true,
#   "transient": {
#     "cluster": {
#       "routing": {
#         "allocation": {
#           "exclude": {
#             "name": "node-0"
#           }
#         }
#       }
#     }
#   }
# }

GET /_cluster/health?wait_for_events=languid&wait_for_no_relocating_shards

# 200 OK
# {
#   "number_of_pending_tasks": 1,
#   "status": "green",
#   "relocating_shards": 0,
#   "cluster_name": "elasticsearch",
#   "initializing_shards": 0,
#   "task_max_waiting_in_queue_millis": 0,
#   "delayed_unassigned_shards": 0,
#   "timed_out": false,
#   "unassigned_shards": 0,
#   "number_of_in_flight_fetch": 0,
#   "number_of_nodes": 2,
#   "active_shards": 0,
#   "active_primary_shards": 0,
#   "number_of_data_nodes": 2,
#   "active_shards_percent_as_number": 100.0
# }

PUT /i
{
  "settings": {
    "number_of_shards": 12,
    "number_of_replicas": 0
  }
}

# 200 OK
# {
#   "shards_acknowledged": true,
#   "acknowledged": true,
#   "index": "i"
# }

GET /_cluster/health?wait_for_events=languid&wait_for_no_relocating_shards

# 200 OK
# {
#   "number_of_pending_tasks": 1,
#   "status": "green",
#   "relocating_shards": 0,
#   "cluster_name": "elasticsearch",
#   "initializing_shards": 0,
#   "task_max_waiting_in_queue_millis": 0,
#   "delayed_unassigned_shards": 0,
#   "timed_out": false,
#   "unassigned_shards": 0,
#   "number_of_in_flight_fetch": 0,
#   "number_of_nodes": 2,
#   "active_shards": 12,
#   "active_primary_shards": 12,
#   "number_of_data_nodes": 2,
#   "active_shards_percent_as_number": 100.0
# }

GET /_cat/allocation?v

# 200 OK
# shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
#      0           0b   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-0
#     12        2.6kb   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-1
# 

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude.name": "node-0"
  }
}

# 200 OK
# {
#   "persistent": {},
#   "acknowledged": true,
#   "transient": {
#     "cluster": {
#       "routing": {
#         "allocation": {
#           "exclude": {
#             "name": "node-0"
#           }
#         }
#       }
#     }
#   }
# }

GET /_cluster/health?wait_for_events=languid&wait_for_no_relocating_shards

# 200 OK
# {
#   "number_of_pending_tasks": 1,
#   "status": "green",
#   "relocating_shards": 0,
#   "cluster_name": "elasticsearch",
#   "initializing_shards": 0,
#   "task_max_waiting_in_queue_millis": 0,
#   "delayed_unassigned_shards": 0,
#   "timed_out": false,
#   "unassigned_shards": 0,
#   "number_of_in_flight_fetch": 0,
#   "number_of_nodes": 2,
#   "active_shards": 12,
#   "active_primary_shards": 12,
#   "number_of_data_nodes": 2,
#   "active_shards_percent_as_number": 100.0
# }

GET /_cat/allocation?v

# 200 OK
# shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
#     12        2.6kb   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-1
#      0           0b   370.7gb     94.8gb    465.6gb           79 127.0.0.1 127.0.0.1 node-0
# 

Sorry for the confusion. There was an issue in the experiment steps. Here is the updated one:

Setup:
Created a 4 node cluster and defined a custom attribute by the name "testAttr". Below are the values of this attribute for each of the node
Node 1 - "0"
Node 2 - "0"
Node 3 - "2"
Node 4 - "3"

This experiment is simulating Case 1 in the very first post.

Get all node attrs:

==================
$ curl localhost:9200/_cat/nodeattrs?v

node   host      ip        attr     value
node-1 127.0.0.1 127.0.0.1 testAttr 0
node-4 127.0.0.1 127.0.0.1 testAttr 3
node-2 127.0.0.1 127.0.0.1 testAttr 0
node-3 127.0.0.1 127.0.0.1 testAttr 2

Get current cluster allocation

============================
$ curl localhost:9200/_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
     0           0b   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-1
     0           0b   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-2
     0           0b   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-4
     0           0b   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-3

Apply the transient setting to exclude node 3 and 4 using name attribute

=================================================================

curl -XPUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient" : {
    "cluster.routing.allocation.exclude.name" : "node-3,node-4"
  }
}
'

Get the current transient settings of the cluster:

============================================
$ curl -XGET localhost:9200/_cluster/settings?pretty

{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "name" : "node-3,node-4"
          }
        }
      }
    }
  }
}

Create an index with 12 shards and 1 replica:

=========================================

curl -XPUT "localhost:9200/test_index" -H 'Content-Type: application/json' -d'
 {
     "settings":{
         "number_of_shards": 12,
         "number_of_replicas": 1
     }
 }'
{"acknowledged":true,"shards_acknowledged":true,"index":"test_index"}

Again get the allocation. As expected shards are allocated on node-1 and node-2

========================================================================
$ curl localhost:9200/_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
     0           0b   191.8gb    273.8gb    465.6gb           41 127.0.0.1 127.0.0.1 node-3
    12        2.6kb   191.8gb    273.8gb    465.6gb           41 127.0.0.1 127.0.0.1 node-1
    12        2.6kb   191.8gb    273.8gb    465.6gb           41 127.0.0.1 127.0.0.1 node-2
     0           0b   191.8gb    273.8gb    465.6gb           41 127.0.0.1 127.0.0.1 node-4

Now apply the transient setting to exclude only node 3 but using custom attribute. Since based on previous exclusion both node 3 and node 4 are excluded, this should be no-op in terms of shard allocation. Expectation is still shards should be on node-1 and node-2 only

====================================================================

curl -XPUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient" : {
    "cluster.routing.allocation.exclude.testAttr" : "2"
  }
}
'

Get the cluster transient settings again. It shows both name and custom attribute based exclusion:

=====================================================
$ curl -XGET localhost:9200/_cluster/settings?pretty

{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "name" : "node-3,node-4",
            "testAttr" : "2"
          }
        }
      }
    }
  }
}

But now when we see the allocation of shards, some are rebalanced to node-4. Expectation is based on above transient settings node-4 should not be eligible for shard allocation.

========================================================================
$ curl localhost:9200/_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
     8        1.7kb   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-4
     0           0b   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-3
     8        1.7kb   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-2
     8        1.7kb   190.8gb    274.8gb    465.6gb           40 127.0.0.1 127.0.0.1 node-1

Aha, gotcha, I see the problem now. It needed to be two different exclude settings, whereas your post above set the _ip setting twice.

This is a bug for sure, but I think it's a bug in the settings infrastructure and not in the FilterAllocationDecider at all. It dates back to 6.1.0 and I suspect (but have not confirmed) it may have been introduced by #26819. How odd that two people noticed it in such quick succession more than two years later...

FilterAllocationDecider uses this affixMapUpdateConsumer. From the implementation of AffixMapUpdateConsumer we can see that it passes only the diffs between current and previous settings for the registered group, Ref here. And then in FilterAllocatinDecider it uses this diff setting information and create a new DiscoveryNodeFilters instance, Ref here.

Proposal:
I think FilterAllocationDecider should not create a new instance each time, instead update the previously created DiscoveryNodeFilters.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.