ES node killed with 6/ABRT - how to find out why and by whom?


(Igor Kupczyński) #1

Hi Elasticsearch Community,

One of the nodes in our cluster got restarted recently.

There is no info in the logs, just this. These are the first three lines for that day:

[2015-06-21 09:21:27,753][INFO ][node                     ] [sjc-elasticsearch-data03-si] version[1.3.7], pid[31096], build[3042293/2014-12-16T13:59:32Z]
[2015-06-21 09:21:27,754][INFO ][node                     ] [sjc-elasticsearch-data03-si] initializing ...
[2015-06-21 09:21:28,101][INFO ][plugins                  ] [sjc-elasticsearch-data03-si] loaded [action-updatebyquery, analysis-icu], sites [HQ, bigdesk, kopf]

It looks like the node was started at 09:21:27 by no info on why it was stopped in the first place.

Other nodes report this:

[2015-06-21 09:02:53,599][DEBUG][action.admin.indices.stats] [sjc-elasticsearch-client01-si] [mdb-pod101-7][9], node[72pxkCzKQ324np0VIXUAkQ], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@6c2238d3]
org.elasticsearch.transport.NodeDisconnectedException: [sjc-elasticsearch-data03-si][inet[/10.255.1.213:9300]][indices/stats/s] disconnected

In the messages log I see this

Jun 21 09:02:53 sjc-elasticsearch-data03 systemd: elasticsearch-sjc-elasticsearch-data03.service: main process exited, code=killed, status=6/ABRT
Jun 21 09:02:53 sjc-elasticsearch-data03 systemd: Unit elasticsearch-sjc-elasticsearch-data03.service entered failed state.

Looks like the node was killed with SIGABRT.

We use puppet for automation, but there was no puppet run at that time.

There was no GC releated info in the logs, so I do not suspect heap issues (the node was running may many days before).

The load was standard at that time.

The OS was up whole time.

Java version:

[ikupczynski@sjc-elasticsearch-data03 ~]$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

ES version:

{
  "status" : 200,
  "name" : "sjc-elasticsearch-client01-si",
  "version" : {
    "number" : "1.3.7",
    "build_hash" : "3042293e4b219dfb855a4e6c64241c530d1abeb0",
    "build_timestamp" : "2014-12-16T13:59:32Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

This is a VM in google compute engine.

After the restart it works fine right now.

Can you advise me what may be the reason of it or how I can debug it further?

Thanks, Igor


(Mark Walkom) #2

Check your OS log to see if anyone sudo'd a restart and/or puppet didn't do something.

Otherwise you're probably not going to have much luck here unless you have something (like Shield) recording any API requests.


(Igor Kupczyński) #3

Hi Mark,

There is nothing in the os/puppet logs which can indicate an
intervention. The server (or rather the VM) was up all the time.

We have a haproxy in front of the cluster, but in the access log again
there is nothing suspicious.

Probably this will stay unsolved, since it was one time event I do not
want to over do it...

Thanks, Igor


(Simon Thorley) #4

Hi,

Did this ever get resolved. I have almost the exact same issue:

Dec 14 15:57:13 elk-esnode03 systemd[1]: elasticsearch-es01.service: main process exited, code=killed, status=6/ABRT
Dec 14 15:57:13 elk-esnode03 systemd[1]: Unit elasticsearch-es01.service entered failed state.

This too is a puppet setup box but it does not control the Elasticsearch setup.

I am also getting the same NotSerializableExceptionWrapper error on my logs:

[2015-12-15 10:24:55,852][DEBUG][action.admin.cluster.node.info] [elk-esnode03-b] failed to execute on node [hSQKljERQdWucsEMCL-p8g]
RemoteTransportException[[elk-esnode01-a][10.218.38.120:9300][cluster:monitor/nodes/info[n]]]; nested: NotSerializableExceptionWrapper;
Caused by: NotSerializableExceptionWrapper[null]

This is Elasticsearch 2.1.0 with same versions of shield and marvel-agent.

In my elasticsearch.yaml i have:

shield.transport.filter.allow: "10.218.38.0/24"

to allow intercluster transport communication but it seems strange that all the errors in my logs are transport connections and only with the action.admin.cluster.node.info action.

Thanks.


(Igor Kupczyński) #5

Hi Simon,

Unfortunately, we were not able to found a root cause. This happened once or twice some time ago and we've never seen it again.

Thanks,
Igor


(Simon Thorley) #6

Just in case anyone else gets to this page...

My issue was a faulty CPU. The BIOS was updated to try and fix but it ended up needing a replacement CPU.

This has been fine since.


(Mark Walkom) #7

That's pretty obscure, glad you found it :slight_smile:


(system) #8