ES 5.2.2 and now every so often a DN node goes offline and then recovers?

somerandomguy · March 13, 2017, 10:25am

Hi.
I have an ES topology that consists the following:

2 x Master node
2 x Data node
2 x Ingest node

5.2.2 On Ubuntu 16.0.4 that was recently upgraded from 2.3.4 (over the weekend and had no problems before with the same amount of load).

Every few hours one of my data nodes goes offline and then recovers again. Everything looks fine. Server pings between each other are also ok.

What kind of logging/debugging do you recommend I enable on my nodes to see why it's happening?

Default viewing the /var/log/elasticsearch/elasticsearch.log shows the following:

Master node 1:
[2017-03-13T18:58:37,879][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3643] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T18:58:37,897][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[heartbeat-2017.03.13][0]], allocation id [UHKyd2DtRNmdpq4n1Yrkfg], primary term [0], message [after peer recovery]]] took [30s] above the warn threshold of 30s
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][young][151784][4813] duration [1.4s], collections [1]/[2.1s], total [1.4s]/[43.8s], memory [1.3gb]->[1.1gb]/[1.9gb], all_pools {[young] [265.9mb]->[3.3mb]/[266.2mb]}{[survivor] [10.4mb]->[7.8mb]/[33.2mb]}{[old] [1.1gb]->[1.1gb]/[1.6gb]}
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][151784] overhead, spent [1.4s] collecting in the last [2.1s]
[2017-03-13T19:59:47,654][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3724] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T19:59:48,006][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[.marvel-es-1-2017.03.05][0]], allocation id [nnPyZrTITy-mmN2UNmhd3w], primary term [0], message [after peer recovery]]] took [30.3s] above the warn threshold of 30s
[2017-03-13T20:29:23,783][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] failed to execute on node [HUAJtLt4Q2WRekWenXFrtQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-03-13T20:29:23,783][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [HUAJtLt4Q2WRekWenXFrtQ]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:247) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$300(TransportNodesAction.java:160) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:219) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1024) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:907) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) ~[elasticsearch-5.2.2.jar:5.2.2]
... 4 more
[2017-03-13T20:29:31,510][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [43887ms] ago, timed out [13887ms] ago, action [internal:discovery/zen/fd/ping], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034403]
[2017-03-13T20:29:31,534][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [22750ms] ago, timed out [7749ms] ago, action [cluster:monitor/nodes/stats[n]], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034488]

warkolm · March 14, 2017, 6:39pm

Check the logs for the master and els04, there may be something that will help.

That's bad, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic

somerandomguy · March 14, 2017, 7:49pm

Thanks for that.
I found out that with the upgrade it removed ulimit and mlock and all those settings so possibly when GC would run it would crash nodes?

I have yet to install X-Pack which I will do next so I can see whats going on!

Anyway, went through this page and now for 12hrs the cluster has been stable.

https://www.elastic.co/guide/en/elasticsearch/reference/current/setting-system-settings.html#setting-system-settings

For a small cluster should the setup like this?

3 x MN
2 x DN
2 x IN ?

Or should I go for 3 nodes of each type?

Do you have a public accessible 'best practices/design guide that I could follow?

Thanks.

warkolm · March 14, 2017, 8:01pm

The definitive guide has some good stuff, my comment was 100% directed at the masters

somerandomguy · March 31, 2017, 4:03am

All good now.

Fixed up my Java settings

warkolm · March 31, 2017, 4:14am

What settings?

somerandomguy · April 1, 2017, 3:09pm

/etc/elasticsearch/jvm.options

I have 16GB RAM on my data nodes so set it to use 1/2.

-Xms8g
-Xmx8g

system · April 29, 2017, 3:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Another node tries to become master (possibly due to GC hangs) Elasticsearch	4	408	July 6, 2017
ElasticSearch : observer: timeout notification from cluster service Elasticsearch	10	11224	July 5, 2017
Zen ping timeout causes nodes to lose master permanently Elasticsearch	4	741	July 6, 2017
Elasticsearch loses its Master every few minutes Elasticsearch	10	6185	July 5, 2017
Connection timeout between nodes Elasticsearch	4	3642	July 5, 2017

ES 5.2.2 and now every so often a DN node goes offline and then recovers?

Related topics