How to completely reset forced shard allocation awareness?

While teaching about shard allocation awareness and forced shard allocation awareness (Elasticsearch Engineer II course), one of the students made a typo in his request to configure forced shard allocation awareness which broke the Elasticsearch cluster. I could not properly clean up things, only found a workaround to make the cluster operational again.

What happened?

  1. Start up a 4-node cluster, in which
  • node1 & node2 are tagged with attribute my_rack with value rack1, and
  • node3 & node4 are tagged with attribute my_rack with value rack2,
  • all but node4 are master-eligible
  1. Configure Shard allocation awareness - still all fine
PUT _cluster/settings
{
  "transient": {
    "cluster": {
      "routing": {
        "allocation.awareness.attributes": "my_rack"
      }
    }
  }
}
  1. Shut down node3 & node4 and get the shards reallocated

  2. Now configure forced shard allocation awareness with the following command:

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation.awareness.attributes": "my_rack",
        "allocation.awareness.force.my_rack_values": "rack1,rack2"
      }
    }
  }
}

Notre the typo: the student accidentally typed an underscore (instead of a dot) to separate attribute name my_rack and the values keyword. While the request passes through it will render the cluster non-functional. nodes seem to quit and no longer being able to join because of the seeting not containing an expected ".".

As a workaround I stopped all nodes, added a the second attribute my_rack_values and also put the shard allocation awareness configuration using this bogus attribute name into the yml-file. This allowed to start up again all nodes, making the cluster operational again, no longer constantly throwing exceptions that nodes cannot join and there is no master node. Obvisouly with a wrong, unwanted attribute name.

After the succesfull startup I tried to change again the shard allocation awareness configuration to the proper attribute name, but it seems that the values for forced shard allocation awareness just get added, not replaced. Initially everything worked, but when shutting down the nodes, commenting out the bogus attribute name and shard allocation rules, again the nodes would not properly start up.

I tried to reset forced shard allocation awareness by setting the property to null, but I haven't managed to come up with a correct syntax that would have allowed me doing so. in the end I was not able to clean up / recover from this bogus statement:

"cluster.routing.allocation.awareness.force.my_rack_values": "rack1,rack2"

This happened with Elasticsearch 7.8.1. Any ideas how to get such a "type" fixed? I was a bit surprised to see how fatal such a typo can be...

That appears to be a nasty bug, would you open an issue for it on Github? Please include the logs in the report, including the exception. The invalid setting update should be rejected.

You can technically recover by shutting the whole cluster down and running elasticsearch-node remove-settings cluster.routing.allocation.awareness.force.my_rack_values on every node, but we don't encourage that sort of thing nor do we think it's a good long-term solution. Best to report it as a bug.

1 Like

FWIW this is rather simple to reproduce, you only need one node with basically any config, and you just have to do this one thing to destroy it:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.force.nonsense": ""
  }
}

:boom:

Thanks @DavidTurner

Before raising it as a bug I wanted to raise it here for clarification. Unfortunately I can't get hold of the logs any longer as the instance has already been wiped.

Elasticsearch issue: Wrong forced shard allocation setting crashing the cluster · Issue #72524 · elastic/elasticsearch · GitHub

Daniel

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.