Master node in ECK with differente IP between pod and elasticsearch

Hi,

Today we faced a strange situation and really want to share with you in order to try to obtain more infos about what can be happened.

Context:

We have a elasticsearch cluster and we need to send slowlogs to Datadog, so we need to inject an annotation in the Elasticsearch Operator.

When this action is made, the all nodes (coord, data and master) are restart one by one by operator, but in the moment of master restart (we have 3), the node is restarted successfully but we faced some problems that can be seen in the logs

{"@timestamp":"2023-10-03T18:42:30.163Z", "log.level": "WARN", "message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [oErFvjDqRCO_59MFI_Dqyg, rZ9wOUTcROaUQq_by1Ve0w, yS4jous1TLWwqMmSRnxKXg], have discovered possible quorum [{es-cm-entrylevel-usc1-prd-01-es-masters-2}{rZ9wOUTcROaUQq_by1Ve0w}{CDy3aDLpSAaJ29swfjJoZQ}{es-cm-entrylevel-usc1-prd-01-es-masters-2}{10.245.141.11}{10.245.141.11:9300}{m}, {es-cm-entrylevel-usc1-prd-01-es-masters-0}{oErFvjDqRCO_59MFI_Dqyg}{jlgRd5XZTYqlvFNw7ChtRg}{es-cm-entrylevel-usc1-prd-01-es-masters-0}{10.245.129.22}{10.245.129.22:9300}{m}, {es-cm-entrylevel-usc1-prd-01-es-masters-1}{yS4jous1TLWwqMmSRnxKXg}{4REJp-4YROGE-U8ogprkeA}{es-cm-entrylevel-usc1-prd-01-es-masters-1}{10.245.130.16}{10.245.130.16:9300}{m}]; discovery will continue using [10.245.129.22:9300, 10.245.130.16:9300] from hosts providers and [{es-cm-entrylevel-usc1-prd-01-es-masters-2}{rZ9wOUTcROaUQq_by1Ve0w}{CDy3aDLpSAaJ29swfjJoZQ}{es-cm-entrylevel-usc1-prd-01-es-masters-2}{10.245.141.11}{10.245.141.11:9300}{m}] from last-known cluster state; node term 9, last-accepted version 355345 in term 7; joining [{es-cm-entrylevel-usc1-prd-01-es-masters-1}{yS4jous1TLWwqMmSRnxKXg}{4REJp-4YROGE-U8ogprkeA}{es-cm-entrylevel-usc1-prd-01-es-masters-1}{10.245.130.16}{10.245.130.16:9300}{m}] in term [9] has status [waiting for response] after [8s/8004ms]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es-cm-entrylevel-usc1-prd-01-es-masters-2][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"es-cm-entrylevel-usc1-prd-01-es-masters-2","elasticsearch.cluster.name":"es-cm-entrylevel-usc1-prd-01"}

What we can observed when this error occured.

  1. When the first master pod was restart, a new IP address was delivered.

  2. In the elasticsearch endpoint (_cat/nodes) we observed that this node was with old IP address

When this situation occured the reconciliation was not finished and we need to make a workaround deleting the masters nodes manually, one by one.

Someone already face a situation like this?

PS: One aditional information: The age of master nodes was above 80days, this can be a problem?

Operator and CRD version: 2.6.0
Elasticsearch Version: 8.6.0

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.