I configured elastic search to run on two nodes, both are of type data/master, in unicast mode. I then wrote my program to initialize a transport client to connect to both nodes. For some reason, either due to network was slow or the node itself was dead, anyway one node was failed. Meanwhile elasticsearch was executing a scheduled job of indexing a great amount of data to the cluster. The transport client started to repeatedly complain one node was unavailable. The whole cluster then was messed up. Below is one sample of the failure message in log I got after I bounced the cluster. What can I do to avoid this from happening?
WARNING: [Blackout] [coverage-elastic1345266122391] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [coverage-elastic1345266122391] shard allocated for local recovery (post api), should exists, but doesn't
My understanding is ElasticSearch is built to keep this from happening, i.e., when some node is dead, the other node should be able to automatically pick up the master role. When the other node is resurrected, or the whole cluster is bounced, that node will be automatically recovered by the healthy node. Am I wrong?