Restarting a Single Node in a Cluster Always Forces a Reallocation


#1

Question
An ES 6.4 node restart does not persist the UUID and causes shard reallocations & sync's when the node rejoins the cluster. The is there another configuration other than path.data that need to be set to persist the UUID?

Issue
When we restart a single ES Docker node, the indices are yellow until a reallocation takes place.
However, I would expect that the restarted node to rejoin and prevent a reallocation. The node rejoins
in a matter of seconds but the cluster remains unhealthy (yellow) until the node departed timeout.

GET _cluster/allocation/explain
...
cannot allocate because the cluster is still waiting 9.1m for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node
...

^ The node containing the shard(s) has already joined and is healthy (yellow).

The ES Docker container mounts a volume for the path.data in the same place on the host so that the data persists across restarts. I would expect the UUID to persist across reboots if the path.data is set correctly, but I see that after a reboot the UUID for the node changes:

/_cat/nodes?v&h=id,ip,node.role,master,name'
id ip node.role master name
jEmZ 172.27.1.73 mdi - 172.27.1.73
YL_h 172.27.1.48 mdi - 172.27.1.48
-dST 172.27.1.124 mdi * 172.27.1.124
cMXK 172.27.1.244 mdi - 172.27.1.244
j1qr 172.27.1.36 mdi - 172.27.1.36

I've tried to follow the guide for ES Rolling Upgrades but that doesn't seem to persist the UUID.

Is there something i'm missing?

Most configurations are set to default ElasticSearch 6.4. settings. This question is similar to what is described here Delayed unassigned shards.

GET _cluster/health
{
"active_primary_shards": 30,
"active_shards": 72,
"active_shards_percent_as_number": 80.0,
"cluster_name": "test-cluster",
"delayed_unassigned_shards": 18,
"initializing_shards": 0,
"number_of_data_nodes": 5,
"number_of_in_flight_fetch": 0,
"number_of_nodes": 5,
"number_of_pending_tasks": 0,
"relocating_shards": 0,
"status": "yellow",
"task_max_waiting_in_queue_millis": 0,
"timed_out": false,
"unassigned_shards": 18
}


#2

Turns out this was an issue with our configuration of Mesos (with Marathon framework) and not ElasticSearch.

We've been running the Mesos agent inside a Docker container. However, this does not work if we want to use Mesos's Persistent Volume feature since the mesos process in docker is unable to mount on the host filesystem.

This issue is resolved when after we moved Mesos to run on directly the host and be managed by systemd. The volumes are persisted and ES is able to restart without generating a new UUID.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.