While teaching about shard allocation awareness and forced shard allocation awareness (Elasticsearch Engineer II course), one of the students made a typo in his request to configure forced shard allocation awareness which broke the Elasticsearch cluster. I could not properly clean up things, only found a workaround to make the cluster operational again.
What happened?
- Start up a 4-node cluster, in which
- node1 & node2 are tagged with attribute
my_rack
with valuerack1
, and - node3 & node4 are tagged with attribute
my_rack
with valuerack2
, - all but node4 are master-eligible
- Configure Shard allocation awareness - still all fine
PUT _cluster/settings
{
"transient": {
"cluster": {
"routing": {
"allocation.awareness.attributes": "my_rack"
}
}
}
}
-
Shut down node3 & node4 and get the shards reallocated
-
Now configure forced shard allocation awareness with the following command:
PUT _cluster/settings
{
"persistent": {
"cluster": {
"routing": {
"allocation.awareness.attributes": "my_rack",
"allocation.awareness.force.my_rack_values": "rack1,rack2"
}
}
}
}
Notre the typo: the student accidentally typed an underscore (instead of a dot) to separate attribute name my_rack
and the values
keyword. While the request passes through it will render the cluster non-functional. nodes seem to quit and no longer being able to join because of the seeting not containing an expected ".".
As a workaround I stopped all nodes, added a the second attribute my_rack_values
and also put the shard allocation awareness configuration using this bogus attribute name into the yml-file. This allowed to start up again all nodes, making the cluster operational again, no longer constantly throwing exceptions that nodes cannot join and there is no master node. Obvisouly with a wrong, unwanted attribute name.
After the succesfull startup I tried to change again the shard allocation awareness configuration to the proper attribute name, but it seems that the values for forced shard allocation awareness just get added, not replaced. Initially everything worked, but when shutting down the nodes, commenting out the bogus attribute name and shard allocation rules, again the nodes would not properly start up.
I tried to reset forced shard allocation awareness by setting the property to null
, but I haven't managed to come up with a correct syntax that would have allowed me doing so. in the end I was not able to clean up / recover from this bogus statement:
"cluster.routing.allocation.awareness.force.my_rack_values": "rack1,rack2"
This happened with Elasticsearch 7.8.1. Any ideas how to get such a "type" fixed? I was a bit surprised to see how fatal such a typo can be...