Hey all,
Recently I had to replace one of my ES nodes which died suddenly on Digital Ocean.
So I ran this command to remove the node from the cluster:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" :{
"cluster.routing.allocation.exclude._ip" : "10.0.0.4"
}
}';echo
(Not using the real IP of the host that was removed in this post) And then restarted elasticsearch on each node. When I was done restarting the cluster I saw that it was stuck in a 'yellow' condition.
[root@logs:~] #curl http://localhost:9200/_cluster/health?pretty=true
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 11,
"active_shards" : 17,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
We're just using this cluster for Logstash. We didn't need any historical data in the logs, so I dust deleted all the indexes using curator, restarted all 3 elasticsearch nodes, then restarted logstash and started again. Also the indexes are backed up once per week in case they ever are needed.
Right now we only have two indices since clearing out all the old indices:
[root@logs:~] #curator show indices --all-indices
2015-11-01 15:44:26,628 INFO Job starting: show indices
2015-11-01 15:44:26,643 INFO Matching all indices. Ignoring flags other than --exclude.
2015-11-01 15:44:26,643 INFO Matching indices:
.kibana
logstash-2015.10.31
logstash-2015.11.01
In the ES logs, I'm seeing these messages:
[2015-11-01 14:05:01,189][WARN ][cluster.action.shard ] [JF_ES1] [logstash-2015.11.01][4] received shard failed for [logstash-2015.11.01][4], node[tFW2k6GdQY6_cJTDf2gbsg], relocating [4-YOtkA2T5WKtEr-j7ivnA], [P], s[RELOCATING], indexUUID [XhSTppLqRWi16GNbpkL2PA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [SendRequestTransportException[[JF_ES2][inet[/10.10.10.5:9300]][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[JF_ES2][inet[/10.10.10.5:9300]] Node not connected]; ]]
[2015-11-01 14:05:01,240][WARN ][cluster.action.shard ] [JF_ES1] [logstash-2015.11.01][1] received shard failed for [logstash-2015.11.01][1], node[363aqoP9QCWoBr8vAkrnZw], relocating [4-YOtkA2T5WKtEr-j7ivnA], [R], s[RELOCATING], indexUUID [XhSTppLqRWi16GNbpkL2PA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [SendRequestTransportException[[JF_ES2][inet[/10.10.10.5:9300]][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[JF_ES2][inet[/10.10.10.5:9300]] Node not connected]; ]]
It's saying that node 2 is not connected. But I am able to telnet from the 1st ES node to the 2nd one using telnet:
[root@logs:~] #telnet es2.example.com 9300
Trying 10.10.10.5...
Connected to es2.example.com.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@logs:~] #telnet es2.example.com 9200
Trying 10.10.10.5...
Connected to es2.example.com.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
I can also telnet from the 2nd ES node to the 1st ES node, though I won't demonstrate that here.
The node that the logs are complaining about is 2nd node (JF_ES2) which just replaced the dead node.
What can I do to correct this problem, and return the cluster to a Green state?
Thanks