UNASSIGNED shards after removing a node

ApurvG · July 29, 2016, 11:00am

I had a 4 node cluster, 2 indices (5 and 10 shards respectively) each with 2 replicas.

I removed one of the nodes from the cluster by shutting it down - it's gone. This was done by first calling _cluster/nodes//shutdown and then killing the elasticsearch process.

But now I see UNASSIGNED shards sitting around and not being distributed to the remaining 3 nodes. enable_allocation is true.

Is there something wrong in how I remove the node from the cluster?

$ curl "localhost:25700/_cat/allocation?v"
shards disk.used disk.avail disk.total disk.percent host ip node
11 3.5tb 3.6tb 7.1tb 49 10.2.34.185 10.2.34.185 130593668838
11 3.4tb 3.6tb 7.1tb 48 10.2.34.181 10.2.34.181 130593666698
12 3.4tb 3.6tb 7.1tb 48 10.2.34.183 10.2.34.183 130593666690
11 UNASSIGNED

$ curl "localhost:25700/_cat/nodes?v"
host ip heap.percent ram.percent load node.role master name
10.2.34.181 10.2.34.181 17 99 15.48 d m 130593666698
10.2.34.185 10.2.34.185 19 99 19.44 d m 130593668838
10.2.34.183 10.2.34.183 14 99 12.65 d * 130593666690

I don't see any pending tasks - I'd assume ES would be moving some shards, but it's not.

$ curl "localhost:25700/_cat/pending_tasks?v"
insertOrder timeInQueue priority source

$ curl "localhost:25700/_cluster/settings?pretty"
{
"persistent" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "all"
}
}
}
},
"transient" : { }
}

ywelsch · July 29, 2016, 12:18pm

can you attach the full cluster state here (or put it on pastebin)? /_cluster/state?pretty
If it contains confidential information, please message me a private link.

ApurvG · July 29, 2016, 2:03pm

Sorry, I had to fix the cluster and did the following:

$ curl -XPUT localhost:25700/_cluster/settings -d '{

"transient" : {
"cluster.routing.allocation.exclude._ip" : "10.2.34.179"
}
}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_ip":"10.2.34.179"}}}}}}

$ curl -XPUT localhost:257'localhost:25700/_cat/allocation?v'
shards disk.used disk.avail disk.total disk.percent host ip node
14 3.5tb 3.5tb 7.1tb 49 10.2.34.181 10.2.34.181 130593666698
13 3.5tb 3.5tb 7.1tb 49 10.2.34.185 10.2.34.185 130593668838
14 3.5tb 3.5tb 7.1tb 49 10.2.34.183 10.2.34.183 130593666690
4 UNASSIGNED
$ curl 'localhost:25700/_cat/allocation?v'
shards disk.used disk.avail disk.total disk.percent host ip node
15 3.5tb 3.5tb 7.1tb 49 10.2.34.183 10.2.34.183 130593666690
15 3.5tb 3.5tb 7.1tb 49 10.2.34.185 10.2.34.185 130593668838
15 3.5tb 3.5tb 7.1tb 49 10.2.34.181 10.2.34.181 130593666698

I do have the shards state in my terminal. [_cluster/state was unfortunately piped to less]
I also realised that _shutdown is removed - so the node removal was basically done by stopping the ES process on 10.2.34.179

$ curl 'localhost:25700/_cat/shards?v'
index shard prirep state docs store ip node
cfileindex 9 p STARTED 886853 106.4mb 10.2.34.183 130593666690
cfileindex 9 r STARTED 886853 106.4mb 10.2.34.185 130593668838
cfileindex 9 r STARTED 886853 106.4mb 10.2.34.181 130593666698
cfileindex 2 p STARTED 886407 106.2mb 10.2.34.185 130593668838
cfileindex 2 r STARTED 886407 106.2mb 10.2.34.181 130593666698
cfileindex 2 r UNASSIGNED
cfileindex 5 p STARTED 886642 106.7mb 10.2.34.185 130593668838
cfileindex 5 r STARTED 886642 106.7mb 10.2.34.181 130593666698
cfileindex 5 r UNASSIGNED
cfileindex 8 p STARTED 885850 106.6mb 10.2.34.183 130593666690
cfileindex 8 r STARTED 885850 106.6mb 10.2.34.185 130593668838
cfileindex 8 r STARTED 885850 106.6mb 10.2.34.181 130593666698
cfileindex 7 p STARTED 885387 106.5mb 10.2.34.183 130593666690
cfileindex 7 r STARTED 885387 106.5mb 10.2.34.185 130593668838
cfileindex 7 r STARTED 885387 106.5mb 10.2.34.181 130593666698
cfileindex 6 p STARTED 886446 106.3mb 10.2.34.183 130593666690
cfileindex 6 r STARTED 886446 106.3mb 10.2.34.185 130593668838
cfileindex 6 r UNASSIGNED
cfileindex 1 p STARTED 886537 106.3mb 10.2.34.183 130593666690
cfileindex 1 r STARTED 886537 106.3mb 10.2.34.181 130593666698
cfileindex 1 r UNASSIGNED
cfileindex 3 p STARTED 884795 106.3mb 10.2.34.183 130593666690
cfileindex 3 r STARTED 884795 106.3mb 10.2.34.185 130593668838
cfileindex 3 r UNASSIGNED
cfileindex 4 p STARTED 885088 106.2mb 10.2.34.183 130593666690
cfileindex 4 r STARTED 885088 106.2mb 10.2.34.181 130593666698
cfileindex 4 r UNASSIGNED
cfileindex 0 p STARTED 885674 106.3mb 10.2.34.183 130593666690
cfileindex 0 r STARTED 885674 106.3mb 10.2.34.185 130593668838
cfileindex 0 r UNASSIGNED
objindex 4 p STARTED 6 55.7kb 10.2.34.183 130593666690
objindex 4 r STARTED 6 55.7kb 10.2.34.185 130593668838
objindex 4 r STARTED 6 55.7kb 10.2.34.181 130593666698
objindex 3 p STARTED 10 111.9kb 10.2.34.183 130593666690
objindex 3 r STARTED 10 111.9kb 10.2.34.181 130593666698
objindex 3 r UNASSIGNED
objindex 1 p STARTED 3 52kb 10.2.34.183 130593666690
objindex 1 r STARTED 3 52kb 10.2.34.185 130593668838
objindex 1 r UNASSIGNED
objindex 2 p STARTED 9 134.2kb 10.2.34.183 130593666690
objindex 2 r STARTED 9 134.2kb 10.2.34.181 130593666698
objindex 2 r UNASSIGNED
objindex 0 p STARTED 8 59.5kb 10.2.34.185 130593668838
objindex 0 r STARTED 8 59.5kb 10.2.34.181 130593666698
objindex 0 r UNASSIGNED

ywelsch · July 29, 2016, 2:12pm

Can you post the output of

curl -XPOST 'localhost:25700/_cluster/reroute?explain' -d '{
    "commands" : [
        {
          "allocate" : {
              "index" : "cfileindex", "shard" : 2, "node" : "130593666690"
          }
        }
    ]
}'

which ES version are you using? Please also make the cluster state available as requested above.

ApurvG · July 29, 2016, 7:00pm

Note that I have removed some confidential mappings/analysis sections.
The ES version is: 2.0.0

cluster state: http://pastebin.com/c9MnAETf
[Note that I have removed confidential mappings and analyzer sections]

Output of explain: http://pastebin.com/HFfax5j4

I'll update the thread in case this reproduces - please let me know if there's anything more I can capture at this stage. It looks like a bug to me that shards were not being assigned.

ywelsch · July 29, 2016, 7:21pm

Once it reproduces, capture the cluster state and try the reroute command with "explain" on an unassigned shard by trying to allocate it to a node that does not have that shard.

Topic		Replies	Views
Unassigned shards, v2 Elasticsearch	5	1344	July 6, 2017
Unallocated shards when a node is removed Elasticsearch	13	2376	July 5, 2017
Unassigned shards Elasticsearch	3	522	July 6, 2017
Unassigned replica shards, and an unused node Elasticsearch	10	1997	July 6, 2017
Assigning, or just Deleting shards Elasticsearch	3	902	July 6, 2017

UNASSIGNED shards after removing a node

Related topics