Hi,
We have a custom ELK stack deployed on IKS (Kubernetes Service @IBM) version 7.1.1 with HelmCharts. The setup is made up from 3x elasticsearch-master nodes, 2x elasticsearch-data nodes, 1x logstash pod for gathering logs and viewing with 1x kibana pod. The big issue is that whenever we do a redeploy of the whole helmchart it takes 30min+ for the master election process to finish.
We activated TRACE logging on the master nodes:
<logger.org.elasticsearch.transport: trace
logger.org.elasticsearch.gateway.MetaStateService: trace >
We see a lot of different logs, I will post some of them we thought are relevant.
// {"type": "server", "timestamp": "2019-10-18T13:02:13,674+0000", "level": "TRACE", "component": "o.e.t.T.tracer", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "[51][internal:cluster/coordination/join] received response from [{elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}]" }
{"type": "server", "timestamp": "2019-10-18T13:02:13,675+0000", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "failed to join {elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=814, lastAcceptedTerm=809, lastAcceptedVersion=513638, sourceNode={elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}, targetNode={elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}}]}" ,
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [elasticsearch-master-1][172.30.188.87:9300][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: received a newer join from {elasticsearch-master-1}{mOXnX6QwQDOeRDqsf4b3dg}{-k6TYqt-Sz2Vancs3JucbA}{172.30.188.87}{172.30.188.87:9300}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}",
"at org.elasticsearch.cluster.coordination.JoinHelper$CandidateJoinAccumulator.handleJoinRequest(JoinHelper.java:451) [elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:512) [elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:478) [elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:124) [elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) [x-pack-security-7.1.1.jar:7.1.1]",
//
//"stacktrace": ["org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 817 while handling publication",
"at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1013) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }
{"type": "server", "timestamp": "2019-10-18T13:02:17,235+0000", "level": "TRACE", "component": "o.e.t.TransportLogger", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "Netty4TcpChannel{localAddress=/172.30.188.87:48396, remoteAddress=elasticsearch-master-headless/172.30.66.112:9300} [length: 58, request id: 94, type: request, version: 6.8.0, action: internal:tcp/handshake] WRITE: 58B" }
{"type": "server", "timestamp": "2019-10-18T13:02:17,235+0000", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "[id: 0xe6aed23d, L:/172.30.188.87:48396 - R:elasticsearch-master-headless/172.30.66.112:9300] WRITE: 58B\n +-------------------------------------------------+\n | 0 1 2 3 4 5 6 7 8 9 a b c d e f |\n+--------+-------------------------------------------------+----------------+\n|00000000| 45 53 00 00 00 34 00 00 00 00 00 00 00 5e 08 00 |ES...4.......^..|\n|00000010| 5c c6 63 00 00 01 06 78 2d 70 61 63 6b 16 69 6e |\.c....x-pack.in|\n|00000020| 74 65 72 6e 61 6c 3a 74 63 70 2f 68 61 6e 64 73 |ternal:tcp/hands|\n|00000030| 68 61 6b 65 00 04 97 ef ab 03 |hake...... |\n+--------+-------------------------------------------------+----------------+" }
{"type": "server", "timestamp": "2019-10-18T13:02:17,236+0000", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "[id: 0xe6aed23d, L:/172.30.188.87:48396 - R:elasticsearch-master-headless/172.30.66.112:9300] FLUSH" }
{"type": "server", "timestamp": "2019-10-18T13:02:17,244+0000", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "[id: 0xe6aed23d, L:/172.30.188.87:48396 - R:elasticsearch-master-headless/172.30.66.112:9300] ACTIVE" }
{"type": "server", "timestamp": "2019-10-18T13:02:17,211+0000", "level": "TRACE", "component": "o.e.t.T.tracer", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "message": "[73][internal:cluster/coordination/join] sent error response" ,
"stacktrace": ["org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 817 does not match current term 818",
//
We can upload the full logs, please point us to a secure way to do that and preferably not public.
Thanks in advance,
Gergely Zoltan