Shards unassigned after node restarts - reason: NODE_LEFT

rkrombho · November 29, 2016, 1:47pm

Hi All,

after we were forced to restart multiple ES 2.2.0 cluster nodes (becuase of massive full GCs due to a heavy aggregation) we now experience ~30 shards (mixed - primaries and replicas) to be unassigned with the reason: NODE_LEFT.

Yes some nodes left but it are back now and the cluster doesn't seem to realize that. It was a kill -9 for the Java process (because it wasn't reacting at all to normal kill signals).

In such situations we always do the following:

disable shard allocation
restart the node (potentially killing the process hard)
wait until it's back
enable shard allocation
wait for recovery to finish
start again with the next node
...

I can even see the shard data directly on the filesystem being present.

All entries in /_cat/recovery are in status 'done'.

There are some shards where only the replica is unassigned and some other shards where replica and primary are unassigned.

When I manually try to allocate (using the reroute api) a shards with primary and replica unassigned, I get the following response:

[allocate] trying to allocate a primary shard [<index>][2], which is disabled

When I do the same for a replica-only unassigned shard, it works when I choose the correct node but it seem to be a temporary thing.
As soon as the node then restarts, the shard will again not recover on this node.

All those shards have the following state:

{
 "state": "UNASSIGNED",
 "primary": false,
 "node": null,
 "relocating_node": null,
 "shard": 0,
 "index": "<index>",
 "version": 32,
 "unassigned_info": {
	"reason": "NODE_LEFT",
	"at": "2016-11-29T12:29:00.089Z",
	"details": "node_left[uMsLnWVGRDijRvKv49UISA]"
 }
}

The ID inside node_left[] does not exist because it was one of the servers ID before we did the restart.
As the server id's change after every restart, this information is quite useless.

The last few messages in the logs with the above mentioned server ID were the following (may be related to heave GC):

[2016-11-29 13:29:00,160][DEBUG][action.search.type       ] [<node>] [<index>][0], node[uMsLnWVGRDijRvKv49UISA], [R], v[55], s[STARTED], a[id=TQ26pj3ZTtK8nwj1TZ7X1Q]: Failed to execute [org.elastic
search.action.search.SearchRequest@64dc9f61]
SendRequestTransportException[[<node>][<ip>:9300][indices:data/read/search[phase/query]]]; nested: TransportException[TransportService is closed stopped can't send request];
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:282)
        at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:142)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:85)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:166)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:245)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:174)
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:46)
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: TransportException[TransportService is closed stopped can't send request]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:303)
        ... 11 more

Does anyone have any idea how we can get our data back and the cluster back to green?

Cheers
Robert

eperry · November 29, 2016, 2:20pm

If all the shards are don't relocating and there is at least 1 of the shards allocated you can move the unallocated shards

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html

I have a nice script that greps for all the unallocated shards and assigns them to a node. From there Elastic rebalances it nicely

Yah kill -9 might have cause you some of the issue

The other option would be to set replication to 0 and then back to 1. Which essentially deletes the secondary shards and creates new replica shards

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

rkrombho · November 29, 2016, 2:23pm

Thanks for the Answer and I'm quite sure this will work work the shards where only replicas are unassigned.

But what about the shards that have replica & primary as unassigned?
Setting their replication to 0 won't really help.

eperry · November 29, 2016, 2:27pm

in the case where both primary and secondary are unassigned. Those have become "corrupt" or were deleted externally to ES and there fore the whole index is broke.

IDK if there is a way to rebuild it, don't think so, How ever, from my experiance that usually means one of the nodes did not come back up, or re-join the cluster.

Do you have kopf or Marvel or Head installed? I would double check the number of data nodes in your cluster.
It might be up but not joined.

rkrombho · November 29, 2016, 2:38pm

Yes I got kopf installed.

The amount of running nodes is correct.
We have 13 master eligible data nodes (no dedicated master) and discovery.zen.minimum_master_nodes with a value of 7.

I wonder if I ran into some sort of split brain scenario.

OK I will probably try to re-index the data (if it's still around) but I'm worried for this to happen again if I don't find the cause.
It's not just logs that I lose here. It's real production data.

rkrombho · November 29, 2016, 2:53pm

Just tried it out with an index where just one replica was unassigned.

Seting replication to 0 of course makes all unassigned shards gone.
Then I configured it back to 1 and wait for the initialization to finish.

After that I now have 2 replica shards unassigned.
Things seem to get worse when I do that.
The reason now changed to REPLICA_ADDED

rkrombho · November 29, 2016, 2:55pm

btw: "logger.indices.recovery" is configured to DEBUG and the only think I see in the logs is this:

[2016-11-29 15:50:17,937][INFO ][cluster.metadata         ] [<node>] updating number_of_replicas to [0] for indices [<index>]
[2016-11-29 15:50:43,458][INFO ][cluster.metadata         ] [<node>] updating number_of_replicas to [1] for indices [<index>]

eperry · November 29, 2016, 3:20pm

i think you have too many masters

masters hold the state of the cluster and has to resync every change,
allocation etc. you can double check me but rule of thumb is like 3 master
nodes maybe 5 if you have a huge cluster

i run with 17 nodes 3 are masters

i would recomend turning off the master on all the other nodes. or spawn 3
light weight nodes as the master does not need many resouces with data node
off 1-4 gig heap does it for me

rkrombho · November 29, 2016, 3:31pm

There is only one master in a cluster as far as I know.
Any node can potentially be a master.

In my setup all nodes can potentially be a master and exatcly one is it.

I know it would be better to have a dedicated low profile node to be the master an not hold any data but I don't think there is a general issue with having all nodes being master eligible

eperry · November 29, 2016, 3:45pm

Right but if you don't turn off master on the other nodes they all get state copies of the Primary master and any fault all the nodes need to re-elect the next master and the more master eligible nodes the longer it takes.

Any way something to think about, it might have been when you kill -9 the servers it lost the state.

rkrombho · November 29, 2016, 6:23pm

Okay I wasn't aware that state is transferred to all potential master nodes.
Will slim that down to 3 nodes.

Still I can't get those unassigned replica shards allocated. Even reseting replicating didn't help.

eperry · November 30, 2016, 4:42am

If the primary shard is allocated and only the secondary is not re-allocating.

you could check your disk space that would prevent a re-allocating shared if it is larger then disk availability (I have experienced old indexes don't totally get deleted)

here is something to read
https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html
you could also make sure your shard routing is turned back on

What is the primary shard's status of the shards that are not allocated, you could be stuck Initializing

Short of that, and additional info, not sure what else it could be.

The only think I can still think of is that a node in your cluster is not healthy. have you restarted all nodes?

eperry · November 30, 2016, 4:47am

what about closing the indexing and re-opening it, reaching for straws

curl http://$HOSTNAME:####/index/_close

curl http://$HOSTNAME:####/index/_open

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html

rkrombho · November 30, 2016, 8:32am

Disk space and all that stuff is okay.

Actually closing and opening the index helped. It now assigned all shards.
Will do that for all indices.

In order to hopefully avoid that in the future I will re-configure the cluster to only have 3 master eligible nodes and do a complete restart after that.

Thanks for all your help

rkrombho · November 30, 2016, 11:58am

That was a bit too early.
Closing and opening the index helped with replicy-only unassigned shards.
I deleted the indices where primary shards could not be assigned (lots of data lost).

Doing that for all indices I finally got back to green.

Suddenly I am now back to red with one primary unassigned and multiple replicas unassigned again.
No node crashed or was restarted.

The logs now show the following message (index names and hostnames replaced):

[2016-11-30 12:37:12,152][WARN ][cluster.action.shard     ] [<master_node>] [<index>][0] received shard failed for [<index>][0], node[nGIL0sSuSBiQmtfSBysQ1A], [R], v[18], s[INITIALIZING], a[id=MrZ3pYlWS6SsY8PmHt_UUQ], unassigned_info[[reason=REALLOCATED_REPLICA], at[2016-11-30T11:37:04.517Z], details[existing allocation of replica to [{<node1>}{n-5JVyqXTYGXIFbEugCS3w}{<ip1>}{<ip1>:9300}{storage_type=hot}] cancelled, sync id match found on node [{<node3>}{nGIL0sSuSBiQmtfSBysQ1A}{<ip3>}{<ip3>:9300}{storage_type=hot}]]], expected_shard_size[638492726], indexUUID [WEj1FIQwSIuPHnRGrxffBA], message [failed to create shard], failure [ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [<index>][0], timed out after 5000ms]; ]
[<index>][[<index>][0]] ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [<index>][0], timed out after 5000ms];
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:602)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:502)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:166)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:600)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:762)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [<index>][0], timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:593)
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:521)
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:306)
        ... 10 more

This message appears for every 14 now unassigned shards multiple times

rkrombho · November 30, 2016, 12:28pm

Ah, the above message doesn't seem to be related to the failed recovery.
Here is the error message from the node where the recovery really initiated after the above cancelation:

[2016-11-30 12:40:26,008][WARN ][indices.cluster          ] [<node3>] [[<index>][1]] marking and sending shard failed due to [failed to create shard]
[<index>][[<index>][1]] ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [<index>][1], timed out after 5000ms];
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:602)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:502)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:166)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:600)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:762)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [<index>][1], timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:593)
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:521)
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:306)
        ... 10 more

Does anyone know what would cause the Shard lock aquisition to timeout?

system · December 28, 2016, 12:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why shard unassigned after cluster restart completely? Elasticsearch	1	394	May 28, 2020
Unassigned shards after node was restarted Elasticsearch	3	546	June 8, 2020
Unnassigned Shards After Node Restart Elasticsearch	3	523	July 5, 2017
Unassigned shards after node leave even when node is excluded Elasticsearch	1	903	March 2, 2019
Unassigned shards: NODE_LEFT (Cloud) Elasticsearch	4	3949	February 8, 2021

Shards unassigned after node restarts - reason: NODE_LEFT

Related topics