[ELK] node shutdown

Good Morning,

I'm in a strange situation.
I started with a single node cluster where were runnning Elasticsearch, Kibana and Logstash. After the analysis of data volumes we decide to expand our cluster, so we added 2 nodes.
Now the configuration is
A-node-> Elasticsearch, Kibana, Logstash
B-node-> Elasticsearch, Logstash
C-Node->Elasticsearch, Kibana
where kibana is balanced via virtual IP.

The nodes are virtual RHEL machines with 64 GB of RAM and 4CPU.

A-node is the node that was alone. Now the configuration of the three nodes are identical except for the names. But A-node continuously crashes down, without error logs or machine lack of resources. I've tried to reinstall logstash and elasticsearch but nothing changed.
When I stop logstash on A-node the node seems to work fine. No extra jobs are scheduled on the machine when the node shuts down.
What could be the cause? What machine or ELK config I have to control?

Thanks a lot.

Can you share the last few logged messages from the node that's shutting down, and tell us what time it shut down (so we can see how long elapsed between the logged messages and the shutdown)?

Also can you share the output of dmesg to see if the operating system is stopping the process for some reason?

Thank you for the answer.

The dmesg output is too large to copy it all. Are you interested in a specific detail?

Now follows the last ES logs of the machine.

[2019-04-02T17:06:13,464][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5498] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:06:25,569][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5510] overhead, spent [258ms] collecting in the last [1s]
[2019-04-02T17:14:41,798][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][6004] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:53:34,328][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8327] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T17:54:00,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8353] overhead, spent [308ms] collecting in the last [1s]
[2019-04-02T17:54:01,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8354] overhead, spent [301ms] collecting in the last [1s]
[2019-04-02T17:54:02,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8355] overhead, spent [375ms] collecting in the last [1s]
[2019-04-02T17:54:03,521][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8356] overhead, spent [380ms] collecting in the last [1s]
[2019-04-02T17:59:25,642][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8677] overhead, spent [289ms] collecting in the last [1s]
[2019-04-02T18:19:14,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9862] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T18:19:15,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9863] overhead, spent [281ms] collecting in the last [1s]
[2019-04-02T21:13:51,410][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][20312] overhead, spent [356ms] collecting in the last [1s]
[2019-04-03T06:41:43,290][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][54333] overhead, spent [380ms] collecting in the last [1s]
[2019-04-03T15:40:34,961][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][86613] overhead, spent [295ms] collecting in the last [1.1s]
[2019-04-03T23:55:56,859][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][116288] overhead, spent [349ms] collecting in the last [1s]

The node crashed at 08:00 am of 04-04-2019

I'm specifically looking for an indication whether the OS shut down the Elasticsearch process, perhaps due to the OOM killer. The only two ways I know for the process to shut down without logging anything are either the OOM killer (which logs to dmesg) or else due to a kill -9 from elsewhere.

Okay. But the biggest problem is that after ES shut down the server is no more reachable by SSH protocol or any other shell. I want to discard every possibility before assume that is a VM corruption problem.

Ah, ok, I thought by "node shuts down" you meant the Elasticsearch process. You mean the whole VM stops responding?

Can you expand on this? How did you determine that the machine has enough resources?

Also, is swap enabled on this machine?

Yes the whole server stop responding.

After the first server failure I've installed metricbeat to see how much resources the processes took, all is normal just before the crash.

Yes Swap is enabled.

Thank you very much.

This is not recommended; it can lead to thrashing. I suggest disabling swap as recommended.

I'll update the machine settings. Hope it Will help solving my problemi. Thnks a lot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.