ES node killed with 6/ABRT - how to find out why and by whom?

igor_k · June 22, 2015, 3:33pm

Hi Elasticsearch Community,

One of the nodes in our cluster got restarted recently.

There is no info in the logs, just this. These are the first three lines for that day:

[2015-06-21 09:21:27,753][INFO ][node                     ] [sjc-elasticsearch-data03-si] version[1.3.7], pid[31096], build[3042293/2014-12-16T13:59:32Z]
[2015-06-21 09:21:27,754][INFO ][node                     ] [sjc-elasticsearch-data03-si] initializing ...
[2015-06-21 09:21:28,101][INFO ][plugins                  ] [sjc-elasticsearch-data03-si] loaded [action-updatebyquery, analysis-icu], sites [HQ, bigdesk, kopf]

It looks like the node was started at 09:21:27 by no info on why it was stopped in the first place.

Other nodes report this:

[2015-06-21 09:02:53,599][DEBUG][action.admin.indices.stats] [sjc-elasticsearch-client01-si] [mdb-pod101-7][9], node[72pxkCzKQ324np0VIXUAkQ], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@6c2238d3]
org.elasticsearch.transport.NodeDisconnectedException: [sjc-elasticsearch-data03-si][inet[/10.255.1.213:9300]][indices/stats/s] disconnected

In the messages log I see this

Jun 21 09:02:53 sjc-elasticsearch-data03 systemd: elasticsearch-sjc-elasticsearch-data03.service: main process exited, code=killed, status=6/ABRT
Jun 21 09:02:53 sjc-elasticsearch-data03 systemd: Unit elasticsearch-sjc-elasticsearch-data03.service entered failed state.

Looks like the node was killed with SIGABRT.

We use puppet for automation, but there was no puppet run at that time.

There was no GC releated info in the logs, so I do not suspect heap issues (the node was running may many days before).

The load was standard at that time.

The OS was up whole time.

Java version:

[ikupczynski@sjc-elasticsearch-data03 ~]$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

ES version:

{
  "status" : 200,
  "name" : "sjc-elasticsearch-client01-si",
  "version" : {
    "number" : "1.3.7",
    "build_hash" : "3042293e4b219dfb855a4e6c64241c530d1abeb0",
    "build_timestamp" : "2014-12-16T13:59:32Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

This is a VM in google compute engine.

After the restart it works fine right now.

Can you advise me what may be the reason of it or how I can debug it further?

Thanks, Igor

warkolm · June 23, 2015, 5:59am

Check your OS log to see if anyone sudo'd a restart and/or puppet didn't do something.

Otherwise you're probably not going to have much luck here unless you have something (like Shield) recording any API requests.

igor_k · June 24, 2015, 9:26am

Hi Mark,

There is nothing in the os/puppet logs which can indicate an
intervention. The server (or rather the VM) was up all the time.

We have a haproxy in front of the cluster, but in the access log again
there is nothing suspicious.

Probably this will stay unsolved, since it was one time event I do not
want to over do it...

Thanks, Igor

Simon_Thorley · December 15, 2015, 11:15am

Hi,

Did this ever get resolved. I have almost the exact same issue:

Dec 14 15:57:13 elk-esnode03 systemd[1]: elasticsearch-es01.service: main process exited, code=killed, status=6/ABRT
Dec 14 15:57:13 elk-esnode03 systemd[1]: Unit elasticsearch-es01.service entered failed state.

This too is a puppet setup box but it does not control the Elasticsearch setup.

I am also getting the same NotSerializableExceptionWrapper error on my logs:

[2015-12-15 10:24:55,852][DEBUG][action.admin.cluster.node.info] [elk-esnode03-b] failed to execute on node [hSQKljERQdWucsEMCL-p8g]
RemoteTransportException[[elk-esnode01-a][10.218.38.120:9300][cluster:monitor/nodes/info[n]]]; nested: NotSerializableExceptionWrapper;
Caused by: NotSerializableExceptionWrapper[null]

This is Elasticsearch 2.1.0 with same versions of shield and marvel-agent.

In my elasticsearch.yaml i have:

shield.transport.filter.allow: "10.218.38.0/24"

to allow intercluster transport communication but it seems strange that all the errors in my logs are transport connections and only with the action.admin.cluster.node.info action.

Thanks.

igor_k · December 15, 2015, 11:35am

Hi Simon,

Unfortunately, we were not able to found a root cause. This happened once or twice some time ago and we've never seen it again.

Thanks,
Igor

Simon_Thorley · March 18, 2016, 2:11pm

Just in case anyone else gets to this page...

My issue was a faulty CPU. The BIOS was updated to try and fix but it ended up needing a replacement CPU.

This has been fine since.

warkolm · March 18, 2016, 10:06pm

That's pretty obscure, glad you found it

Topic		Replies	Views
Unexpected cluster behavior Elasticsearch	3	345	July 6, 2017
Nodes restarting automatically Elasticsearch	23	1548	July 6, 2017
Finding the reason behind random node shutdowns Elasticsearch	4	688	July 5, 2017
Elasticsearch stopped itself from time to time Elasticsearch	4	894	July 6, 2017
Unexpected ES shut down - nothing there in logs to identify the problem Elasticsearch	7	1464	July 6, 2017

ES node killed with 6/ABRT - how to find out why and by whom?

Related topics