Hi,
I tried to upgrade ECK from version 2.7.0. Unfortunately, ECK version 2.8.0 and up are unable to manage my Elastic cluster. I upgraded Elastic a couple of time with ECK version 2.7.0 and it actually runs version 8.9.2.
If I upgrade ECK to a new version, a node is removed an a new one is provisioned. Then, it completely stops and I can see the following error in ECK logs:
{"log.level":"info","@timestamp":"2023-11-22T15:05:38.659Z","log.logger":"elasticsearch-controller","message":"Elasticsearch cannot be reached yet, re-queuing","service.version":"2.10.0+59c1e727","service.type":"eck","ecs.version":"1.4.0","iteration":"1242","namespace":"elasticsearch","es_name":"es-logging","namespace":"elasticsearch","es_name":"es-logging"}
On the Elastic side, I have another error as follows:
{"@timestamp":"2023-11-22T15:15:20.320Z", "log.level": "WARN", "message":"caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/10.42.87.108:9200, remoteAddress=/10.42.124.19:50414}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es-logging-es-logging-2][transport_worker][T#1]","log.logger":"org.elasticsearch.http.AbstractHttpServerTransport","elasticsearch.cluster.uuid":"REDACTED","elasticsearch.node.id":"REDACTED","elasticsearch.node.name":"es-logging-es-logging-2","elasticsearch.cluster.name":"es-logging","error.type":"io.netty.handler.codec.DecoderException","error.message":"javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate","error.stack_trace":"io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.common@4.1.94.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:1623)\nCaused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate\n\tat java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130)\n\tat java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:365)\n\tat java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:287)\n\tat java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)\n\tat java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)\n\tat java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:736)\n\tat java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:691)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482)\n\tat java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679)\n\tat io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:297)\n\tat io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)\n\tat io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1246)\n\tat io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1295)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)\n\t... 16 more\n"}
My cluster definition is the following:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: es-logging
namespace: elasticsearch
spec:
http:
service:
spec:
type: LoadBalancer
tls:
certificate:
secretName: elastic-cert
nodeSets:
- config:
node.roles:
- master
- data
- ingest
- ml
- transform
- remote_cluster_client
count: 3
name: logging
podTemplate:
spec:
initContainers:
- command:
- sh
- '-c'
- sysctl -w vm.max_map_count=262144
name: sysctl
securityContext:
privileged: true
runAsUser: 0
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
updateStrategy:
changeBudget:
maxSurge: 3
maxUnavailable: 1
version: 8.9.2
volumeClaimDeletePolicy: DeleteOnScaledownOnly
I tried to do another cluster without custom cert (Removed spec.http.tls.certificate.secretName property) and ECK can manage it without any trouble.
The cert in use is a wildcard cert. I also tried with a dedicated cert (not working either), with SAN for the following domains:
- elastic.domain.ext
- es-logging-es-http.elasticsearch.es.local
- es-logging-es-http
- es-logging-es-http.elasticsearch
- es-logging-es-http.elasticsearch.svc
- es-logging-es-internal-http.elasticsearch
- es-logging-es-internal-http.elasticsearch.svc
- *.es-logging-es-logging.elasticsearch.svc
- lb.exposed.ip.address
Am I missing something? How can I use a custom cert with a newer version of ECK?