I'm in a strange situation.
I started with a single node cluster where were runnning Elasticsearch, Kibana and Logstash. After the analysis of data volumes we decide to expand our cluster, so we added 2 nodes.
Now the configuration is
A-node-> Elasticsearch, Kibana, Logstash
B-node-> Elasticsearch, Logstash
C-Node->Elasticsearch, Kibana
where kibana is balanced via virtual IP.
The nodes are virtual RHEL machines with 64 GB of RAM and 4CPU.
A-node is the node that was alone. Now the configuration of the three nodes are identical except for the names. But A-node continuously crashes down, without error logs or machine lack of resources. I've tried to reinstall logstash and elasticsearch but nothing changed.
When I stop logstash on A-node the node seems to work fine. No extra jobs are scheduled on the machine when the node shuts down.
What could be the cause? What machine or ELK config I have to control?
Can you share the last few logged messages from the node that's shutting down, and tell us what time it shut down (so we can see how long elapsed between the logged messages and the shutdown)?
Also can you share the output of dmesg to see if the operating system is stopping the process for some reason?
The dmesg output is too large to copy it all. Are you interested in a specific detail?
Now follows the last ES logs of the machine.
[2019-04-02T17:06:13,464][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5498] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:06:25,569][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5510] overhead, spent [258ms] collecting in the last [1s]
[2019-04-02T17:14:41,798][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][6004] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:53:34,328][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8327] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T17:54:00,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8353] overhead, spent [308ms] collecting in the last [1s]
[2019-04-02T17:54:01,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8354] overhead, spent [301ms] collecting in the last [1s]
[2019-04-02T17:54:02,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8355] overhead, spent [375ms] collecting in the last [1s]
[2019-04-02T17:54:03,521][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8356] overhead, spent [380ms] collecting in the last [1s]
[2019-04-02T17:59:25,642][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8677] overhead, spent [289ms] collecting in the last [1s]
[2019-04-02T18:19:14,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9862] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T18:19:15,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9863] overhead, spent [281ms] collecting in the last [1s]
[2019-04-02T21:13:51,410][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][20312] overhead, spent [356ms] collecting in the last [1s]
[2019-04-03T06:41:43,290][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][54333] overhead, spent [380ms] collecting in the last [1s]
[2019-04-03T15:40:34,961][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][86613] overhead, spent [295ms] collecting in the last [1.1s]
[2019-04-03T23:55:56,859][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][116288] overhead, spent [349ms] collecting in the last [1s]
I'm specifically looking for an indication whether the OS shut down the Elasticsearch process, perhaps due to the OOM killer. The only two ways I know for the process to shut down without logging anything are either the OOM killer (which logs to dmesg) or else due to a kill -9 from elsewhere.
Okay. But the biggest problem is that after ES shut down the server is no more reachable by SSH protocol or any other shell. I want to discard every possibility before assume that is a VM corruption problem.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.