[ELK] node shutdown

Alessio_Creo · April 4, 2019, 8:09am

Good Morning,

I'm in a strange situation.
I started with a single node cluster where were runnning Elasticsearch, Kibana and Logstash. After the analysis of data volumes we decide to expand our cluster, so we added 2 nodes.
Now the configuration is
A-node-> Elasticsearch, Kibana, Logstash
B-node-> Elasticsearch, Logstash
C-Node->Elasticsearch, Kibana
where kibana is balanced via virtual IP.

The nodes are virtual RHEL machines with 64 GB of RAM and 4CPU.

A-node is the node that was alone. Now the configuration of the three nodes are identical except for the names. But A-node continuously crashes down, without error logs or machine lack of resources. I've tried to reinstall logstash and elasticsearch but nothing changed.
When I stop logstash on A-node the node seems to work fine. No extra jobs are scheduled on the machine when the node shuts down.
What could be the cause? What machine or ELK config I have to control?

Thanks a lot.

DavidTurner · April 4, 2019, 1:35pm

Can you share the last few logged messages from the node that's shutting down, and tell us what time it shut down (so we can see how long elapsed between the logged messages and the shutdown)?

Also can you share the output of dmesg to see if the operating system is stopping the process for some reason?

Alessio_Creo · April 4, 2019, 2:07pm

Thank you for the answer.

The dmesg output is too large to copy it all. Are you interested in a specific detail?

Now follows the last ES logs of the machine.

[2019-04-02T17:06:13,464][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5498] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:06:25,569][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][5510] overhead, spent [258ms] collecting in the last [1s]
[2019-04-02T17:14:41,798][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][6004] overhead, spent [260ms] collecting in the last [1s]
[2019-04-02T17:53:34,328][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8327] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T17:54:00,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8353] overhead, spent [308ms] collecting in the last [1s]
[2019-04-02T17:54:01,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8354] overhead, spent [301ms] collecting in the last [1s]
[2019-04-02T17:54:02,479][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8355] overhead, spent [375ms] collecting in the last [1s]
[2019-04-02T17:54:03,521][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8356] overhead, spent [380ms] collecting in the last [1s]
[2019-04-02T17:59:25,642][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][8677] overhead, spent [289ms] collecting in the last [1s]
[2019-04-02T18:19:14,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9862] overhead, spent [298ms] collecting in the last [1s]
[2019-04-02T18:19:15,331][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][9863] overhead, spent [281ms] collecting in the last [1s]
[2019-04-02T21:13:51,410][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][20312] overhead, spent [356ms] collecting in the last [1s]
[2019-04-03T06:41:43,290][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][54333] overhead, spent [380ms] collecting in the last [1s]
[2019-04-03T15:40:34,961][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][86613] overhead, spent [295ms] collecting in the last [1.1s]
[2019-04-03T23:55:56,859][INFO ][o.e.m.j.JvmGcMonitorService] [nodo-d] [gc][116288] overhead, spent [349ms] collecting in the last [1s]

The node crashed at 08:00 am of 04-04-2019

DavidTurner · April 4, 2019, 2:13pm

I'm specifically looking for an indication whether the OS shut down the Elasticsearch process, perhaps due to the OOM killer. The only two ways I know for the process to shut down without logging anything are either the OOM killer (which logs to dmesg) or else due to a kill -9 from elsewhere.

Alessio_Creo · April 4, 2019, 2:25pm

Okay. But the biggest problem is that after ES shut down the server is no more reachable by SSH protocol or any other shell. I want to discard every possibility before assume that is a VM corruption problem.

DavidTurner · April 4, 2019, 2:28pm

Ah, ok, I thought by "node shuts down" you meant the Elasticsearch process. You mean the whole VM stops responding?

Can you expand on this? How did you determine that the machine has enough resources?

Also, is swap enabled on this machine?

Alessio_Creo · April 4, 2019, 2:35pm

Yes the whole server stop responding.

After the first server failure I've installed metricbeat to see how much resources the processes took, all is normal just before the crash.

Yes Swap is enabled.

Thank you very much.

DavidTurner · April 4, 2019, 2:39pm

This is not recommended; it can lead to thrashing. I suggest disabling swap as recommended.

Alessio_Creo · April 4, 2019, 5:12pm

I'll update the machine settings. Hope it Will help solving my problemi. Thnks a lot.

system · May 2, 2019, 5:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search cluster Elasticsearch	2	255	July 6, 2017
Automatic partial shutdown of cluster Elasticsearch	3	648	July 6, 2017
Unexpected ES shut down - nothing there in logs to identify the problem Elasticsearch	7	1462	July 6, 2017
Elasticsearch node stopped itself Elasticsearch	2	542	March 27, 2017
Node will not shut down Elasticsearch	5	440	July 6, 2017

[ELK] node shutdown

Related topics