Hi All,
after we were forced to restart multiple ES 2.2.0 cluster nodes (becuase of massive full GCs due to a heavy aggregation) we now experience ~30 shards (mixed - primaries and replicas) to be unassigned with the reason: NODE_LEFT.
Yes some nodes left but it are back now and the cluster doesn't seem to realize that. It was a kill -9 for the Java process (because it wasn't reacting at all to normal kill signals).
In such situations we always do the following:
- disable shard allocation
- restart the node (potentially killing the process hard)
- wait until it's back
- enable shard allocation
- wait for recovery to finish
- start again with the next node
- ...
I can even see the shard data directly on the filesystem being present.
All entries in /_cat/recovery are in status 'done'.
There are some shards where only the replica is unassigned and some other shards where replica and primary are unassigned.
When I manually try to allocate (using the reroute api) a shards with primary and replica unassigned, I get the following response:
[allocate] trying to allocate a primary shard [<index>][2], which is disabled
When I do the same for a replica-only unassigned shard, it works when I choose the correct node but it seem to be a temporary thing.
As soon as the node then restarts, the shard will again not recover on this node.
All those shards have the following state:
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "<index>",
"version": 32,
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2016-11-29T12:29:00.089Z",
"details": "node_left[uMsLnWVGRDijRvKv49UISA]"
}
}
The ID inside node_left[] does not exist because it was one of the servers ID before we did the restart.
As the server id's change after every restart, this information is quite useless.
The last few messages in the logs with the above mentioned server ID were the following (may be related to heave GC):
[2016-11-29 13:29:00,160][DEBUG][action.search.type ] [<node>] [<index>][0], node[uMsLnWVGRDijRvKv49UISA], [R], v[55], s[STARTED], a[id=TQ26pj3ZTtK8nwj1TZ7X1Q]: Failed to execute [org.elastic
search.action.search.SearchRequest@64dc9f61]
SendRequestTransportException[[<node>][<ip>:9300][indices:data/read/search[phase/query]]]; nested: TransportException[TransportService is closed stopped can't send request];
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:282)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:142)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:85)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:166)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:245)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:174)
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:46)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: TransportException[TransportService is closed stopped can't send request]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:303)
... 11 more
Does anyone have any idea how we can get our data back and the cluster back to green?
Cheers
Robert