Cat api nodes returning stale data?

ES version: 7.7.1

In our startup/shutdown scripts we call _cat/nodes in order to get the number of running nodes to decide if we need to change the shard allocation (only the first stop should change it). The script then sequentially hits one node after another, if there are less than the number of expected nodes running, it should not touch shard allocation.

It seems like in some cases after one node has been successfully shutdown (we send a SIGTERM and wait for the pid to disappear so we're sure it's really gone) the call to _cat/nodes still includes the previously shut down node.

Before I dig deeper and reproduce reliably - is that expected behavior and the departure of the node takes some time to replicate between the nodes? The time between a completed shutdown and API call might be subsecond but still I expected the API to return the actualy current state.

Why are you doing this in the first place? Why not let Elasticsearch control shard allocation automatically?

Yes that's the expected behaviour. Nodes do not wait to be removed from the cluster before the process exits.

You're probably looking for GET _cluster/health?wait_for_nodes=<n>.

We consider this best practice, otherwise the process would take much longer, cause a lot of unnecessary I/O and increase the risk of timeouts etc. See rolling-upgrades for reference.

Thanks for the confirmation.

I didn't know about these parameters to health, they are indeed useful.
That would probably work, with the little downside that we would always wait timeout seconds when it's not necessary (once on the first node that is being touched) and that I don't know what a good setting for timeout would be (how long max until the node is removed from the cluster).

My other idea is to query the node allocation from _cluster/settings beforehand and only change it, if it is set to all.
Or could I run into the same situation that the settings are returning stale data too?

I don't follow. GET _cluster/health?wait_for_nodes=<n> returns immediately if there are <n> nodes in the cluster. It only waits if the wait conditions aren't satisfied, which sounds like exactly what you want.

I don't see how that would help; cluster settings are not updated when a node leaves the cluster.

(Also you can set timeout=0s and get an immediate response either way)

I don't follow. GET _cluster/health?wait_for_nodes=<n> returns immediately if there are <n> nodes in the cluster. It only waits if the wait conditions aren't satisfied, which sounds like exactly what you want.

Maybe I skipped too much context or simplified too much. Some background:

We use ansible to call a local elasticsearch-control-script (bash) on each of the nodes sequentially (in this example with the parameter "stop"). All the logic regarding the changes of the shard allocation is inside this control script. This is to allow manual execution of the script or executions by a different framework than ansible in the future. Also the same script must support a bullet proof, automated rolling stop / start as well as a complete shutdown and restart (which is a bit challenging). All this is running completely automated for upgrades, OS patchings, downtimes etc.

Let's say the cluster has 3 nodes. So far the script counted the nodes returned by calling _cat/nodes. If it's 3, disable shard allocation. If it's less than 3, don't touch shard allocation (assume that has happened on the first node). We do the reverse on the start: if we have 3 nodes running, enable the shard allocation.

Scenario with GET _cluster/health?wait_for_nodes=<n>
The condition in this case would be wait_for_nodes<3 because if we stick with the logic from above. The condition is not satisfied on the first node when ansible hits the first node because all nodes are still up running. After the timeout the script will realize that all nodes are up (it's the way to know that it's the first node) and it will disable the shard allocation. Then it will shut down the elastic process on that node.
Then ansible will call the same command on the second node. GET _cluster/health?wait_for_nodes<3 will return immediately or at least fast (once the first node is really gone) and it will not change the shard allocation because the assumption is that has already happened on the first node.

My other idea is to query the node allocation from _cluster/settings beforehand

I don't see how that would help; cluster settings are not updated when a node leaves the cluster.

No, but I don't need that info this case. Instead of counting the nodes to determine "Am I the first node in the procedure" I could just query the setting of the shard allocation. If it's disabled, I won't disable it again (which returned a 503 before for some reason, that's why I realized this issue in the first place). This only helps for the stop case though - when starting up the nodes I want to be sure all the nodes are up and running before I enable the shard allocation.

Using local heuristics to try and work out whether this is the first execution of your script or not is a distributed coordination problem (i.e. hard). It also seems unnecessary, if you are running the scripts in sequence then you surely already know whether you're running it for the first time or not. I'd recommend factoring out the bit that only runs once and doing that separately before the loop.

I partially agree, this comes from (the not so distant) past when manual intervention and restarts were required more often and sometimes done manually (at least on non-prod environments).

But on the other hand, there are still rare cases where the scripts are not executed in sequence and just one node has to be restarted and then it would be easy to forget about the shard allocation (which is not a catastrophe, it just might cause pretty long delays).

Even in the loop there would need to be some logic to determine the cluster state before flipping shard allocation back on (make sure all nodes are up) and error handling if steps in between fail. And as for now I rather have that logic in bash than in ansible but that's just me probably :slightly_smiling_face:

If you run just one or two clusters this might seem a smaller problem but we run around 10 of different sizes and update them regularly with the tools that we have and that we know.

Next up is k8s and operators that might make some things easier and other things worse, but we'll see :smiley:

I think I will test the approach with checking the status of the shard allocation from the cluster settings (that logic is there anyway already for the startup phase) and if that runs stable I'll stick with it for the time being.

Thanks for the discussion!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.