ECK clusters and client sniffing

It seems difficult to set up an ECK ElasticSearch cluster where client sniffing works. Does anybody have any suggestions on best approaches here? Our clusters are currently on AWS EKS in 3 availability zones behind a single load balancer with cross-zone load balancing enabled. We seem to suffer from slow detection when nodes fail and I don't think there's any retry on connection failure (can't recall if that's provided by any of the clients anyway? We use the Python client FWIW).

Hi,

There is an ongoing issue to fully support client sniffing with eck: https://github.com/elastic/cloud-on-k8s/issues/3182

Could you share your Elasticsearch manifest and the client configuration (loadbalancer used, sniffer_timeout, tls settings...) ?

My question is similar to the original poster's, what is the best practice to connect to ES on kubernetes with rest clients?

  • We are running Elasticsearch version 7.4.2 on kubernetes, AWS EKS, deployed via the elasticsearch
    helm chart, version: 8.0.0-SNAPSHOT, sources: https://github.com/elastic/elasticsearch
  • We're using the Java RestHighLevelClient to query ES.
  • Our ES cluster has 3 data nodes and 1 dedicated master in dev. 3 dedicated masters in prod

Our current approach is:

  • A dedicated rest client for sniffing, this always sniffs through the load balancer
jyLP4JCyQnuq4BvvnSysSA": {
"name": "elasticsearch-data-1",
"transport_address": "10.20.3.140:9300",
"host": "10.20.3.140",
"ip": "10.20.3.140",
"version": "7.4.2",
"build_flavor": "default",
"build_type": "docker",
"build_hash": "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
"roles": [
"ingest",
"data"
],
"attributes": {
"xpack.installed": "true"
},
"http": {
"bound_address": [
"0.0.0.0:9200"
],
"publish_address": "10.20.3.140:9200",
"max_content_length_in_bytes": 104857600
}
},
  • the load balancer is a classic lb, and it points only to the data nodes.
  • Use the Sniffer with our search and indexing rest clients
  • set the search and indexing rest clients NodeSelector to SKIP_DEDICATED_MASTERS
  • Sniff interval: 4 min
  • Sniff on fail delay: 1 min

Thus, every 4 min the sniffer "sniffs" through the load balancer and sets the search or index client's nodes to the es-data pod IP addresses. If a connection to one of the pod ip addresses fails, then it will sniff for new nodes 1 minute later. If the search or index clients have a failing request, they will retry it on the other nodes.

This ~should~ keep us covered during rolling deploys or a full cluster outage. However; is there a different way that is "best practice"? Any comments / concerns with the approach we have now?

Any experience the community can share would be greatly appreciated.