Cluster drops several days after enabling tls/monitoring

This weekend we enabled TLS on our cluster on port 9300 for cluster communications, as well as xpack monitoring, and we added a node to the cluster that runs kibana and elasticsearch as an ingest node.

Every few days though, the entire cluster will drop. When we ran curls against the cluster immediately today it said "master_not_discovered_exception". On the dedicated master we can see it failing to connect to several nodes, and then crashing with an OOM error.

[2019-10-17T22:13:49,257][WARN ][o.e.c.InternalClusterInfoService] [prod-es-master-1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2019-10-17T22:13:49,268][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [prod-es-master-1] failed to execute on node [AsNd5ftKQHSgfWQkcyxCfw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-es-data-hot-09fc][10.4.0.201:9300][cluster:monitor/nodes/stats[n]] request_id [27054592] timed out after [11919ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.0.jar:6.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
[2019-10-17T22:13:49,268][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [prod-es-master-1] failed to execute on node [dI92tkzrRQiFygyTlMV8WQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-es-data-hot-07d8][10.4.2.9:9300][cluster:monitor/nodes/stats[n]] request_id [27054594] timed out after [11919ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.0.jar:6.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
... truncated ...
[2019-10-17T22:14:08,557][WARN ][o.e.c.InternalClusterInfoService] [prod-es-master-1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2019-10-17T22:14:08,558][WARN ][o.e.d.z.PublishClusterStateAction] [prod-es-master-1] timed out waiting for all nodes to process published state [67192] (timeout [30s], pending nodes: [{prod-es-data-warm-0bb5}{hq8OTUjmQ7aUmnN_k4VPig}{drBWHrveQqSnJL1d-bWhjg}{10.4.0.41}{10.4.0.41:9300}{aws_availability_zone=us-west-2a, data_type=warm, ml.machine_memory=64388997120, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-094a}{J9mN6h08SPGWiVVOOqvR2A}{NDK5QpN_Sf2N5LmbXNNRzg}{10.4.2.98}{10.4.2.98:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663171072, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-0a98}{vtcq5ZooSn2Ng2RqEC4AGw}{_A8c5pjjRUCBiYZIiQbZZQ}{10.4.2.162}{10.4.2.162:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663105536, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-02d6}{k30ah9ZSTEy1wNtxA6QFXg}{invdJc20S96uJxq4kg2pSA}{10.4.2.82}{10.4.2.82:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663097344, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-warm-0f9d}{FxUO0zsHTe2enzY3MsYx8w}{kNq6aSCLQG-heN5meZFrdQ}{10.4.0.197}{10.4.0.197:9300}{aws_availability_zone=us-west-2a, data_type=warm, ml.machine_memory=64389132288, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-08c9}{rROyZ4AATk-EPvKpCR-Nbg}{SiYGwb3iTICJ_a0uZDyvIw}{10.4.2.18}{10.4.2.18:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663097344, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-0559}{JTIQxU4FR5qzdhvMaySTng}{iiylnx7PRoC4JbuRcmNMXQ}{10.4.2.140}{10.4.2.140:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663101440, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-0eee}{i3QqkoOOS4KuIvwDOlYIDg}{JeCloM32RB2lwtQcfs-afg}{10.4.2.10}{10.4.2.10:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663113728, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-warm-06f9}{RlZ8wsCkTBa782nj3J0Rjw}{sTfgAENhTSaF468uUIdMdA}{10.4.2.94}{10.4.2.94:9300}{aws_availability_zone=us-west-2c, data_type=warm, ml.machine_memory=64389132288, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}])
[2019-10-17T22:14:16,137][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es-master-1] [gc][old][248712][384] duration [7.5s], collections [1]/[7.5s], total [7.5s]/[1.6m], memory [3.8gb]->[3.8gb]/[3.8gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [15.5mb]->[16.1mb]/[16.6mb]}{[old] [3.7gb]->[3.7gb]/[3.7gb]}
[2019-10-17T22:14:16,137][WARN ][o.e.m.j.JvmGcMonitorService] [prod-es-master-1] [gc][248712] overhead, spent [7.5s] collecting in the last [7.5s]
... truncated ...
[2019-10-17T22:15:05,481][WARN ][o.e.t.TransportService   ] [prod-es-master-1] Received response for a request that has timed out, sent [57869ms] ago, timed out [26161ms] ago, action [internal:discovery/zen/fd/ping], node [{prod-es-master-2}{tZgkR_ptSw2nilh78mTdiQ}{NP-nRhLPRVKmij4pRIJyVA}{10.4.2.217}{10.4.2.217:9300}{aws_availability_zone=us-west-2c, data_type=none, ml.machine_memory=8369979392, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [27055408]
[2019-10-17T22:16:55,854][ERROR][o.e.ExceptionsHelper     ] [prod-es-master-1] fatal error
... truncated ...
java.lang.OutOfMemoryError: Java heap space

We weren't using Kibana at all today, but right before the cluster dropped the node ran out of memory and it appears to be the elasticsearch service that caused it. In the elasticsearch logs on this node we can see the message unexpected error while indexing monitoring document and this is the only ingest node within the cluster. Is it possible that we need to upgrade this kibana node and it's causing the cluster to drop? We are running ES 6.8.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.