Cluster drops several days after enabling tls/monitoring

Jesbourne · October 18, 2019, 12:49am

This weekend we enabled TLS on our cluster on port 9300 for cluster communications, as well as xpack monitoring, and we added a node to the cluster that runs kibana and elasticsearch as an ingest node.

Every few days though, the entire cluster will drop. When we ran curls against the cluster immediately today it said "master_not_discovered_exception". On the dedicated master we can see it failing to connect to several nodes, and then crashing with an OOM error.

[2019-10-17T22:13:49,257][WARN ][o.e.c.InternalClusterInfoService] [prod-es-master-1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2019-10-17T22:13:49,268][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [prod-es-master-1] failed to execute on node [AsNd5ftKQHSgfWQkcyxCfw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-es-data-hot-09fc][10.4.0.201:9300][cluster:monitor/nodes/stats[n]] request_id [27054592] timed out after [11919ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.0.jar:6.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
[2019-10-17T22:13:49,268][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [prod-es-master-1] failed to execute on node [dI92tkzrRQiFygyTlMV8WQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-es-data-hot-07d8][10.4.2.9:9300][cluster:monitor/nodes/stats[n]] request_id [27054594] timed out after [11919ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.0.jar:6.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
... truncated ...
[2019-10-17T22:14:08,557][WARN ][o.e.c.InternalClusterInfoService] [prod-es-master-1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2019-10-17T22:14:08,558][WARN ][o.e.d.z.PublishClusterStateAction] [prod-es-master-1] timed out waiting for all nodes to process published state [67192] (timeout [30s], pending nodes: [{prod-es-data-warm-0bb5}{hq8OTUjmQ7aUmnN_k4VPig}{drBWHrveQqSnJL1d-bWhjg}{10.4.0.41}{10.4.0.41:9300}{aws_availability_zone=us-west-2a, data_type=warm, ml.machine_memory=64388997120, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-094a}{J9mN6h08SPGWiVVOOqvR2A}{NDK5QpN_Sf2N5LmbXNNRzg}{10.4.2.98}{10.4.2.98:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663171072, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-0a98}{vtcq5ZooSn2Ng2RqEC4AGw}{_A8c5pjjRUCBiYZIiQbZZQ}{10.4.2.162}{10.4.2.162:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663105536, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-02d6}{k30ah9ZSTEy1wNtxA6QFXg}{invdJc20S96uJxq4kg2pSA}{10.4.2.82}{10.4.2.82:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663097344, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-warm-0f9d}{FxUO0zsHTe2enzY3MsYx8w}{kNq6aSCLQG-heN5meZFrdQ}{10.4.0.197}{10.4.0.197:9300}{aws_availability_zone=us-west-2a, data_type=warm, ml.machine_memory=64389132288, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-08c9}{rROyZ4AATk-EPvKpCR-Nbg}{SiYGwb3iTICJ_a0uZDyvIw}{10.4.2.18}{10.4.2.18:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663097344, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-percolate-0559}{JTIQxU4FR5qzdhvMaySTng}{iiylnx7PRoC4JbuRcmNMXQ}{10.4.2.140}{10.4.2.140:9300}{aws_availability_zone=us-west-2c, data_type=percolate, ml.machine_memory=32663101440, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-hot-0eee}{i3QqkoOOS4KuIvwDOlYIDg}{JeCloM32RB2lwtQcfs-afg}{10.4.2.10}{10.4.2.10:9300}{aws_availability_zone=us-west-2c, data_type=hot, ml.machine_memory=32663113728, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {prod-es-data-warm-06f9}{RlZ8wsCkTBa782nj3J0Rjw}{sTfgAENhTSaF468uUIdMdA}{10.4.2.94}{10.4.2.94:9300}{aws_availability_zone=us-west-2c, data_type=warm, ml.machine_memory=64389132288, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}])
[2019-10-17T22:14:16,137][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es-master-1] [gc][old][248712][384] duration [7.5s], collections [1]/[7.5s], total [7.5s]/[1.6m], memory [3.8gb]->[3.8gb]/[3.8gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [15.5mb]->[16.1mb]/[16.6mb]}{[old] [3.7gb]->[3.7gb]/[3.7gb]}
[2019-10-17T22:14:16,137][WARN ][o.e.m.j.JvmGcMonitorService] [prod-es-master-1] [gc][248712] overhead, spent [7.5s] collecting in the last [7.5s]
... truncated ...
[2019-10-17T22:15:05,481][WARN ][o.e.t.TransportService   ] [prod-es-master-1] Received response for a request that has timed out, sent [57869ms] ago, timed out [26161ms] ago, action [internal:discovery/zen/fd/ping], node [{prod-es-master-2}{tZgkR_ptSw2nilh78mTdiQ}{NP-nRhLPRVKmij4pRIJyVA}{10.4.2.217}{10.4.2.217:9300}{aws_availability_zone=us-west-2c, data_type=none, ml.machine_memory=8369979392, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [27055408]
[2019-10-17T22:16:55,854][ERROR][o.e.ExceptionsHelper     ] [prod-es-master-1] fatal error
... truncated ...
java.lang.OutOfMemoryError: Java heap space

We weren't using Kibana at all today, but right before the cluster dropped the node ran out of memory and it appears to be the elasticsearch service that caused it. In the elasticsearch logs on this node we can see the message unexpected error while indexing monitoring document and this is the only ingest node within the cluster. Is it possible that we need to upgrade this kibana node and it's causing the cluster to drop? We are running ES 6.8.

system · November 15, 2019, 12:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Upgrade cluster to 7.9.2, master_not_discovered_exception Elasticsearch	4	812	November 5, 2020
Master not discovered exception Elasticsearch	2	255	August 12, 2022
Elasticsearch cluster of 4 nodes has "master not discovered exception" Elasticsearch	18	28564	May 18, 2018
ES 6.2.3 cluster goes down unexpectedly Elasticsearch	3	734	June 7, 2018
Healthy cluster is completely hosed by one node failing Elasticsearch docker	6	1510	November 4, 2019

Cluster drops several days after enabling tls/monitoring

Related topics