I've setup a 6-node Elasticsearch 7.17.0 cluster which has to serve a single read-only ~1TB index.
The nodes have 2TB of storage, so I've set that index to be a single shard, with 5 replicas (so each node gets a full copy of the index).
Both the cluster and the index were green.
Today, I've changed the port in which one of the nodes (cluster-node-24
) listens. Due to some early tests, the node was listening on port 9201, and I've just moved it back to 9200.
The thing is - the cluster is now yellow. The allocation explain API says that it's trying to allocate a shard because it's node left, but it can't since no node has enough storage available:
$ curl 'localhost:9200/_cluster/allocation/explain?pretty'
{
"note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
"index" : "the_target_index",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2022-06-16T19:20:42.459Z",
"details" : "node_left [d1KGHbo6RQaHLpBNOU6GnQ]",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "HQhro8j3ReCHv3LAgUF-Bw",
"node_name" : "cluster-node-26",
"transport_address" : "192.168.244.126:9300",
"node_attributes" : {
"ml.machine_memory" : "135088414720",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[the_target_index][0], node[HQhro8j3ReCHv3LAgUF-Bw], [P], s[STARTED], a[id=f5yp09RHSYKNTHa1SQjKSg]]"
},
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [509.4gb], estimated shard size: [1.1tb])"
}
]
},
{
"node_id" : "d1KGHbo6RQaHLpBNOU6GnQ",
"node_name" : "cluster-node-24",
"transport_address" : "192.168.244.124:9300",
"node_attributes" : {
"ml.machine_memory" : "135082483712",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [511.2gb], estimated shard size: [1.1tb])"
}
]
},
{
"node_id" : "iDzDo9hfSUW3N2uJo_14Qg",
"node_name" : "cluster-node-21",
"transport_address" : "192.168.244.121:9300",
"node_attributes" : {
"ml.machine_memory" : "135082483712",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[the_target_index][0], node[iDzDo9hfSUW3N2uJo_14Qg], [R], s[STARTED], a[id=4g1YYexaSeKXTTibGDyRhQ]]"
},
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [511.7gb], estimated shard size: [1.1tb])"
}
]
},
{
"node_id" : "lwB2cT2JRfe9CTv2KF6CNw",
"node_name" : "cluster-node-23",
"transport_address" : "192.168.244.123:9300",
"node_attributes" : {
"ml.machine_memory" : "135090417664",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[the_target_index][0], node[lwB2cT2JRfe9CTv2KF6CNw], [R], s[STARTED], a[id=lzitYi11SESnWl4IY6awsA]]"
},
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [507.9gb], estimated shard size: [1.1tb])"
}
]
},
{
"node_id" : "wDCIUfW_QqSJbiz8FwFSOg",
"node_name" : "cluster-node-22",
"transport_address" : "192.168.244.122:9300",
"node_attributes" : {
"ml.machine_memory" : "135082483712",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[the_target_index][0], node[wDCIUfW_QqSJbiz8FwFSOg], [R], s[STARTED], a[id=_HbW6DPYRBm9FrHvgQk5vA]]"
},
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [511.3gb], estimated shard size: [1.1tb])"
}
]
},
{
"node_id" : "wa1qscSTTiCPgUgw-O8TxA",
"node_name" : "cluster-node-25",
"transport_address" : "192.168.244.125:9300",
"node_attributes" : {
"ml.machine_memory" : "135082483712",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "33285996544",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[the_target_index][0], node[wa1qscSTTiCPgUgw-O8TxA], [R], s[STARTED], a[id=6p2bttpzRl-PHfCifwIdiA]]"
},
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [513gb], estimated shard size: [1.1tb])"
}
]
}
]
}
This is the list of current allocations:
$ curl -X GET "localhost:9200/_cat/allocation?v=true&h=node,shards,disk.*&pretty"
node shards disk.indices disk.used disk.avail disk.total disk.percent
cluster-node-22 5 1.1tb 1.2tb 511.3gb 1.7tb 70
cluster-node-25 5 1.1tb 1.2tb 513gb 1.7tb 70
cluster-node-26 5 1.1tb 1.2tb 509.4gb 1.7tb 70
cluster-node-21 5 1.1tb 1.2tb 511.7gb 1.7tb 70
cluster-node-24 0 0b 1.2tb 511.2gb 1.7tb 70
cluster-node-23 5 1.1tb 1.2tb 507.9gb 1.7tb 71
UNASSIGNED 1
The cluster-node-24
has the same ID as before (compare with the NODE_LEFT
error in the explain above):
$ curl -X GET "localhost:9200/_cat/nodes?v&h=name,id&full_id=true"
name id
cluster-node-21 iDzDo9hfSUW3N2uJo_14Qg
cluster-node-23 lwB2cT2JRfe9CTv2KF6CNw
cluster-node-25 wa1qscSTTiCPgUgw-O8TxA
cluster-node-26 HQhro8j3ReCHv3LAgUF-Bw
cluster-node-24 d1KGHbo6RQaHLpBNOU6GnQ
cluster-node-22 wDCIUfW_QqSJbiz8FwFSOg
Here's the list of indices I have:
$ curl -X GET "localhost:9200/_cat/indices?v"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .geoip_databases wqmlJW5zQ6ePLv5Noc8Zdw 1 1 41 41 77.8mb 38.9mb
yellow open the_target_index Mb2ZIRXbQO6RTeeMT4KVeQ 1 5 270722174 0 5.5tb 1.1tb
green open other_index kpzkVtJcSviNxsytfgYaTw 1 1 277223 0 15.8gb 7.9gb
green open other_index_ 2 Lxl58z9WSb2ooHuTM3hdpA 1 1 2 0 195.1kb 97.5kb
green open .tasks Ry1PYtMQS7CK1qyaz4Oeww 1 1 11 0 140.6kb 70.3kb
And I can clearly see in the cluster-node-24
that it still has the index data's stored:
sudo du -h /usr/share/elasticsearch/data/nodes/0/indices/Mb2ZIRXbQO6RTeeMT4KVeQ | tail -n1
1.2T /usr/share/elasticsearch/data/nodes/0/indices/Mb2ZIRXbQO6RTeeMT4KVeQ
I'd expect the node to keep being "the same" (have the same ID, so no need to reallocate anything), or to somehow notice that the shard that's trying to get allocated is the one that the node already has (so say "I'll get this one" and be instantly ready), or at least for the node to notice it has a lot of stored data that it's not using (since the node thinks it has no shards assigned) and delete the disk space - and then sync the shard again.
But none of this seems to be happening.
Am I missing something? How do I fix this?
My current best alternative is to decrease the replica count to 4, so the index gets green again, and then increase it back to 5 so the cluster-node-24
gets a replica again. But this plan kinds of assumes that the node will eventually notice that it has a lot of stored data that's not using - which I'm not sure if will happen at all.
Another alternative will be to manually delete that data directory - but I don't know which potential issues could this cause.
Moving the node back to the 9201 port didn't fix the issue - it's still showing basically the same output on all of these commands.
(I've redacted the nodes & indices names, but none of the IDs)