not sure how good of an idea it is but I'm thinking of adding a few lines to the SystemD unit file for Elasticsearch so the service will automatically restart if it dies.
If a node crashes it is because some catastrophe has happened and it is likely that it will crash again if it restarts. So it's better to be notified by the incident to analyze the cause so the node can come back online again safely (instead of going into an endless restart loop that will do more harm to the cluster then good)
I thought about that. I should probably spend more time on finding the root cause for my problems. Some sort of OOM issue brought down my 20 node cluster the other day. After the nodes were started again the cluster has been pretty stable for a few days. So a restart "fixed" the problem short term. But the longterm solution is to do the legwork and find the underlying issue...
Will hold off on restarting Elasticsearch automatically then.
In this case, restarting could solve the issue. But what if it makes worse? You can not account for all the failure possibilities, so there is always a chance that restarting will make it worse. Because of this, it's simply better to keep monitoring the cluster and take proper action in case of failure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.