Trying desperately to avoid our Citrix ADC health monitoring of elasticsearch https port 9200 backend services from seeing LB monitoring traffic as plain text and thus avoid log error events like:
received plaintext http traffic on an https channel, closing connection Netty4HttpChannel{localAddress=/10.141.33.64:9200, remoteAddress=/10.94.139.12:42849}
remote address .12 is our Citrix ADC health monitoring IP.
Have tried various health monitor type, TCP-ECV + Secure:YES, HTTP-ECV + Secure:YES, https-ecv in various versions, Testing eg. from ADC shell with curl -kv https://nodeIP:9200/_cluster/health?pretty returns fine cluster status json and does properly TLS1.3 handshake, but the monitors causes the plain text complains all the time spaming the cluster logs
Despite the elastic side log complains, the monitors all deems backend service as successfully UP, itâs only the log being spammed.
I am confused what is generating the monitoring health checks.
My experience is many integrations to Elasticsearch default to http not https .
As I am sure you know that is exactly what that message is.. whatever is doing the monitoring is sending http to and https endpoint
OR
your LB is doing SSL termination on the health check URL/endpoint ... and thus the LB is receiving https , terminating SSL then forwarding http to the health check endpoint.
This probably did not help.. but perhaps I am missing something.
Technically, it just means that what ES received on the channel was not a valid TLS record, we donât actually check to make sure itâs actually a HTTP request:
In any case this still seems like invalid traffic coming from the monitoring system, despite which itâs still finding ES to be healthy. Thereâs no more diagnostics available within ES itself AFAIK, you will need to look at a packet capture to understand exactly what is being sent.
ES is terminating the SSL thus our LB is only doing SSLBRIGDE for incoming requests forwarded to ES end points, but to know which ES backend points are healthy to forward/LB request towards our LB needs to do backend monitoring .Thus we configure a monitor to probe healthy state of ES, though every attempt to get Citrix ADC use proper https probing req eg. GET /_cluster/health looking for say return code 200 or cluster_name in response, ES keeps claiming that the probing health req.s are sending http data prior to finishing TLS handshake or not attempting TLS handshake at all thus seeing http on a https listening socket of ES. The health monitors are still claiming success, eg that is is seeing f.ex. cluster_name in responses, thus you would belive that ES is responsing as expects and the request ought to be using https/TLS handshakes, so why is ES still caliming to see data send before finishing TLS handshake, we wonder.
So it seems that ES is closing the socket upon receiving data prior to TLS handshake has finished, thus ADC must be doing more than our configured https-ecv req as these are successfully responsed ADC finds. Maybe ADC is also doing say simple tcp syn-ack hidden monitoring or would such resolve in as we saw on the Fleet services:
http: TLS handshake error from 10.94.139.12:64091: EOF
This stopped on the fleet services when altered the monitor probe from tcp syn-ack to https-ecv using âGET /â requests.
How are you determining this? ES is closing the socket because the client doesnât seem to be doing TLS correctly, resulting in a NotSslRecordException being thrown. I wouldnât have expected this to be triggered by sending properly-framed data just at the wrong time, instead I think it indicates the data is not properly-framed somehow. Could be wrong, Iâm not digging through the TLS implementation to work out the details, but still everything here looks to be a misbehaving client.
Yes, but that does not mean itâs receiving data prior to completing the TLS handshake (the claim in your earlier message), it just means that itâs seeing non-TLS data on this channel at some point. It might be before the TLS handshake completes, but it might also be after completing a request.
Agree, it was a statement I most properly picked up from conversation with ChatGPT
Appreciated your help. It just frustrating to debug such issues. may have to try and dig into low level utils like nstcpdump/nstrace on the ADC nodes, thatâll be a first time for me.
Belive Iâve been mislead by the fact of seen the ADCâ NSIP as source IP for plain text made me focus on the ADC health monitoring. It might be an issue with Elastic Agents failing to pull local node stats throught our configured Fleet Policies for elastic node self monitoring has the Elasticsearch Integration>Metrics /Stack Monitoring)->Settings->Host: https://localhost:9200 + SSL Configuration: verification_mode: certificate so such polling ought to come from localhost and not ADC NSIPs as seen via tcpdump on complaing target f.ex:
These isnât ADC health monitorin probes, but looks very much like elastic agents elasticsearch integration polling node stat metrics.
ChatGPT says:
Ah â now I see exactly what youâre puzzling over. Yes, you are correct: the ADC is just bridging TLS, it does not convert anything to plain HTTP. So anything showing up as plain HTTP hitting ES cannot come from the ADC itself.
What is actually happening is likely this chain:
1. Fleet / Elastic Agent configured to poll ES metrics
⢠Host: https://localhost:9200
⢠SSL verification: certificate
2. Elastic Agent cannot verify the SSL certificate for localhost
⢠Because the ES node certificate is issued for its FQDN, not localhost.
⢠Verification fails.
3. Fallback behavior
⢠Fleet/Agent tries the next configured endpoint â often the Fleet Global Setting pointing to ADC NSIP (HTTPS).
⢠But if that somehow misconfigures the URL scheme or SSL handshake fails, the agent may retry over plain HTTP to avoid complete failure.
4. Result on ES logs
⢠You see ADC NSIP as the source.
⢠Requests are plain HTTP.
⢠These requests are not originating from the ADC, but from the agent itself using ADC as âhostâ in the URL.
So, the ADC is just an innocent bystander â it never decrypts or converts the request. Itâs the agentâs failback behavior that triggers plain HTTP requests to ES via the ADC NSIP.
â The correct fix is indeed:
⢠Set verification_mode: none for local node polling so the agent talks HTTPS to localhost without failing cert verification.
⢠Make sure the agent never needs to âfall backâ to the ADC for node stats.
Might agents make some sort of failback from IntegrationâMetricsâHost setting?
Mmpossibly tho it seems like a pretty serious bug to fall back to http://localhost if we decided https://localhost couldnât be trustedâŚ
Could you share the whole HTTP request? Particularly interested in the User-agent header, this should help confirm itâs Elastic Agent or something else.
Would like not to believe everything ChatGpt is speculating into.
Our ADC was just appearing as source IPs but thatâs due to that fact that they are just do NAT/forwarding for the real caller, thus expanding tcpdump on the active ADC node reavealed the real callers source IPs
So believe I found the culprits now through this.
We have resently migrated a production ES cluster to all new nodes, retried the previous nodes running self monitoring both internally and via metricbeat. Metricbeats were still running to monitor the nodes general system metrics among other application metrics. Only I forgot to disable the elasticsearch + elasticsearch-xpack metricbeat modules on these previous ES nodes, so it appears that they ended up polling /_nodes/ stats through metricbeats output FQDN via our ADC vIP but apprently not using the default output protocol https but http. Note that the previous production nodes were also complaining over plain text on they SSL port 9200, but back then it was attributed to ADC health probes not doing proper https when probing according to a certain Chat service on the net, so now we have upgraded the ADC SW from v.13.1 to latest v.14.1 claimed to be able to do proper https-ecv probes.
Disabled self monitoring in the production cluster, as we are now using a dedicated monitoring cluster and using all elastic agents and fleet policy to poll ES nodes. This didnât stop the plain text complains.
But disabling the previous ES production nodesâ metricbeat elasticsearch modules, have stopped complains over http plain text from the new ES nodes behind our ADC LB
Cool stuff... glad you found it ... those darn leftover bits!
I will change my wording from exactly to "commonly indicates http traffic to https endpoint" especially here on discuss where folks do this all the time ... like all the time
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.