Elasticsearch cluster with AWS ELB randomly disconnects from applications

We have elasticsearch 5.5.3 cluster with an ELB to load balance the traffic to the cluster but every so often our search microservices that connect to the cluster via ELB FQDN (fully qualified domain name) throw this exception:

org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [{#transport#-1}{10.6.6.122}{10.6.6.122:9300}]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:326)

The issue is even worse via batch application that connects to ELB to run bulkIndex. Job runs for 5 hours or so and then fails with:

2018-05-03 21:32:49.726 INFO 5529 --- [elasticsearch[client][generic][T#1]] o.e.c.t.TransportClientNodesService : failed to get node info for {#transport#-1}{urmSiRXiSY6e3ahRzbgjuw}{internal-staging-es55-elb-1936370144.us-east-1.elb.amazonaws.com}{10.4.6.235:9300}, disconnecting...

org.elasticsearch.transport.ReceiveTimeoutTransportException: [][10.4.6.235:9300][cluster:monitor/nodes/liveness] request_id [17756] timed out after [8248ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

We put an ELB monitor on the machine thats runs nslookup every minute and logs it on the machine but we are seeing the issue even when ELB IP has not been updated.
We opened a support ticket with AWS and they recommended we switch to an NLB that has static IP which never updates. That did not help either and we continue to see this exception.

Application restart resolves the issue and if we connect our batch application to one of the nodes directly by providing it an IP address of the node, we do not see any issues.

We tried putting a coordinating node in the cluster and pointed our batch application to this coordinating node, indexing worked but was much slower than when we pointed to the ELB or the cluster node itself. Another reason we are a bit hesitant in adding a coordinating node to our cluster is that it gives us a single point of failure. If the coordinating node goes down, all our search microservices and our batch application will go down.

What is the recommended way to connect to es cluster from various applications?

Please let me know if you need additional details on the issue.

We don't recommend using a load balancer in transport port like that. It will cause many issues (like the one you are seeing) among other things.

Either let the transport client connect directly to the cluster nodes (a transport client is a native load balancer) or use the AWS ELB in http port and connect to the cluster using the Java High Level Rest Client. The latter is preferred since transport client is deprecated and will be removed in future releases.

Thanks for your response.

Per my understanding, since we are still using ES 5, we cannot use Java High Level Rest Client but we will have a look to see when we can upgrade.

We can definitely let the transport client connect directly to the cluster nodes. I have few questions around it though:

  • Should we set client.transport.sniff to true? Are there any scenarios where setting this to true can have performance impacts?
  • Should we list all our nodes in the properties files or is listing IP address of one node enough? We have only 3 nodes in staging/production environment but how should we manage nodes being added/accidentally going down without manually updating properties file everytime. What are elastic recommendations around this?
  • If transport client acts like a load balancer then how come elasticsearch has a dedicated coordinating node that load balances the traffic. What are the good usecases for using a coordinating node?

Thanks,
Mahrukh.

There is no known performance impact. You can set it as long as all node addresses are reachable by the client.

It needs to be a known IPs to connect to the cluster. Generally speaking, the recommended is to set the addresses of dedicated master nodes, if you have them.

The Transport Client is mainly a coordinating-only node embedded in an application. Nowadays, it's preferred to have isolated coordinating-only nodes and the Transport Client is being deprecated. The advantages of coordinating-only node is that it off loads, from the data nodes, the reduce phase of aggregation queries which can use lots of memory and CPU. Of course it will add network latency, since it's one more hop in the network, but that is generally compensated by the extra CPU and memory that the node adds to the cluster. It's not necessary, though, to index through a coordinating-only node (unless it's an Ingest Node and you are using Ingest features) and you can just send data directly to data nodes.

Since our cluster is too small, we do not have dedicated master nodes. Should we provide the complete list of nodes in our properties files? What is the recommendation when you dont have dedicated master nodes?

There is no recommendation in this case. It just needs to be a known set of IPs to connect to. Can either be all nodes or just a single ip (if you are using sniffing, it will just figure out the rest).

Hi Thiago,

I made all updates like we had discussed above but today we saw this error while indexing data:

2018-06-12 14:20:19.205 INFO 2014 --- [elasticsearch[client][generic][T#2]] o.e.c.t.TransportClientNodesService : failed to get local cluster state for {#transport#-1}{TDb9sx-2Su-Ei-f60ZX8Bw}{10.x.x.xxx}{10.x.x.xxx:9300}, disconnecting...

org.elasticsearch.transport.ReceiveTimeoutTransportException: [][10.x.x.xxx:9300][cluster:monitor/state] request_id [3590] timed out after [6451ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Batch job continued without any issues but I am wondering why we are seeing this error. This is a 3 node cluster and all three nodes are mentioned in the properties file as a comma separated list.

Thanks,
Mahrukh.

That is a connection level timeout. It means that nodes are under stress and it's failing to reply to certain operations within the configured timeout.

You could try increasing the ping timeout client.transport.ping_timeout to a value that's better for your use case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.