Kibana state shange to red


(sahere rahimi) #1

hi all,
I have a cluster with three nodes (235, 236 ,237) on windows; kibana installed in all three servers so that each kibana connects to elasticsearch of its server.
sometimes the state of kibana changes from green to red. log of elasticsearch for some cases are as following:
based on .monitoring-kibana index, kibana state changes to red as following.
1- case 1:
kibana red: node-236 in 4-14-2019- 10:43, 10:44

 [2019-04-14T10:44:51,329][INFO ][o.e.d.z.ZenDiscovery     ] [node-236] master_left [{node-237}{ynuGMeSVRYKsa_-95s0YmA}{o1aAN6vDT1akB5HKXyhZrQ}{0.0.0.237}{0.0.0.237:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
 [2019-04-14T10:44:51,344][WARN ][o.e.d.z.ZenDiscovery     ] [node-236] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes: {node-237}{ynuGMeSVRYKsa_-95s0YmA}{o1aAN6vDT1akB5HKXyhZrQ}{0.0.0.237}{0.0.0.237:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master{node-236}{UWi2vw4-QfqJ_5rDe-j80A}{Ww4jfFFiROu4h-yW-UZLsQ}{0.0.0.236}{0.0.0.236:9300}{ml.machine_memory=8589328384, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
 [2019-04-14T10:44:54,375][WARN ][o.e.d.z.ZenDiscovery     ] [node-236] not enough master nodes discovered during pinging (found [[Candidate{node={node-236}{UWi2vw4-QfqJ_5rDe-j80A}{Ww4jfFFiROu4h-yW-UZLsQ}{0.0.0.236}{0.0.0.236:9300}{ml.machine_memory=8589328384, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=67330}]], but needed [2]), pinging again
 [2019-04-14T10:44:55,125][WARN ][o.e.d.z.UnicastZenPing   ] [node-236] failed to send ping to [{node-237}{ynuGMeSVRYKsa_-95s0YmA}{o1aAN6vDT1akB5HKXyhZrQ}{0.0.0.237}{0.0.0.237:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-237][0.0.0.237:9300][internal:discovery/zen/unicast] request_id [19137729] timed out after [3860ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]

2- case 2:

kibana red : node-236 in 4-14-2019- 10:46

 [2019-04-14T10:45:09,316][WARN ][r.suppressed             ] [node-236] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
 [2019-04-14T10:45:10,113][INFO ][o.e.c.s.ClusterApplierService] [node-236] detected_master {node-237}{ynuGMeSVRYKsa_-95s0YmA}{o1aAN6vDT1akB5HKXyhZrQ}{0.0.0.237}{0.0.0.237:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {node-237}{ynuGMeSVRYKsa_-95s0YmA}{o1aAN6vDT1akB5HKXyhZrQ}{0.0.0.237}{0.0.0.237:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [67331]])
 [2019-04-14T10:46:23,710][WARN ][o.e.m.j.JvmGcMonitorService] [node-236] [gc][old][86350][23] duration [22.9s], collections [1]/[23.7s], total [22.9s]/[36.7s], memory [2.7gb]->[1.5gb]/[3.9gb], all_pools {[young] [90.2mb]->[12.4mb]/[266.2mb]}{[survivor] [25.2mb]->[0b]/[33.2mb]}{[old] [2.6gb]->[1.5gb]/[3.6gb]}
 [2019-04-14T10:46:23,710][WARN ][o.e.m.j.JvmGcMonitorService] [node-236] [gc][86350] overhead, spent [23s] collecting in the last [23.7s]

3- case 3:

kibana red : node-237 in 4-14-2019- 01:47,01:48, 01:49

 [2019-04-14T01:47:35,887][INFO ][o.e.m.j.JvmGcMonitorService] [node-237] [gc][5306110] overhead, spent [388ms] collecting in the last [1.2s]
 [2019-04-14T01:47:54,451][INFO ][o.e.m.j.JvmGcMonitorService] [node-237] [gc][5306128] overhead, spent [315ms] collecting in the last [1.1s]
 [2019-04-14T01:48:27,181][WARN ][o.e.t.TransportService   ] [node-237] Received response for a request that has timed out, sent [30842ms] ago, timed out [705ms] ago, action [internal:discovery/zen/fd/ping], node [{node-235}{s2_LzCq1TxCWple_fIx9yg}{-q2P7Z8oQcW6d3uiOJaCtA}{0.0.0.235}{0.0.0.235:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [889561226]
 [2019-04-14T01:48:28,394][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-237] failed to execute on node [s2_LzCq1TxCWple_fIx9yg]org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-235][0.0.0.235:9300][cluster:monitor/nodes/stats[n]] request_id [889579252] timed out after [15166ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
 [2019-04-14T01:48:29,257][WARN ][o.e.t.TransportService   ] [node-237] Received response for a request that has timed out, sent [15981ms] ago, timed out [815ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node-235}{s2_LzCq1TxCWple_fIx9yg}{-q2P7Z8oQcW6d3uiOJaCtA}{0.0.0.235}{0.0.0.235:9300}{ml.machine_memory=8589328384, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [889579252]
 [2019-04-14T01:48:31,084][WARN ][o.e.m.j.JvmGcMonitorService] [node-237] [gc][young][5306164][2494235] duration [1.2s], collections [1]/[1.3s], total [1.2s]/[19.8h], memory [3gb]->[2.8gb]/[3.9gb], all_pools {[young] [255.1mb]->[66.5mb]/[266.2mb]}{[survivor] [29.9mb]->[27.8mb]/[33.2mb]}{[old] [2.7gb]->[2.7gb]/[3.6gb]}
 [2019-04-14T01:48:36,126][WARN ][o.e.m.j.JvmGcMonitorService] [node-237] [gc][5306169] overhead, spent [519ms] collecting in the last [1s]
 [2019-04-14T01:49:59,288][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-237] collector [cluster_stats] timed out when collecting data

any advice will be so appreciated.


(Mark Walkom) #2

Looks like you have nodes dropping out due to excessive GC.

What version are you on?
How many shards and indices?


(sahere rahimi) #3

many thanks for your reply,

elk version is 6.5. totally, there are 161 indices, 729 primary shard and 729 replica shard, also there are 163306395 documents, size of data is 191.6 GB; assigned heap memory of each server (there are three servers) is 8GB. also, number of primary shards configured as 5 and number of replica shards configured as 1