We have deployed ES 5.6.3 on an AWS auto Scaling Group across AZ, this group has a load balancer and that is what we use as ES host.
Using new high level elastic client as follows:
val sniffOnFailureListener = new SniffOnFailureListener
val lowLevelRestClient = RestClient
.builder(new HttpHost(host, esPort, "http")) //Host being the Loadbancer url and port is 9200
.setFailureListener(sniffOnFailureListener)
.build
val sniffer = Sniffer.builder(lowLevelRestClient)
.setSniffAfterFailureDelayMillis(10000)
.build()
sniffOnFailureListener.setSniffer(sniffer)
val elasticClient = ElasticClient.apply(lowLevelRestClient, sniffer, index, indexType)
ES cluster is searched by a fleet of service deployed as Kubernetes pods.
Everything has been working perfectly. However recently we started getting intermittent errors, where few of the pods (service instance) get sniffing errors. As a result few queries gets very high latency in the order of seconds.
Here is the log trace of the errors in service pod. There are no error logs on elastic nodes.
Nov 13, 2017 11:16:37 PM org.elasticsearch.client.sniff.Sniffer sniff
SEVERE: error while sniffing nodes
java.io.IOException: listener timeout after waiting for [30000] ms
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:660)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:219)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:191)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:170)
at org.elasticsearch.client.sniff.ElasticsearchHostsSniffer.sniffHosts(ElasticsearchHostsSniffer.java:93)
at org.elasticsearch.client.sniff.Sniffer$Task.sniff(Sniffer.java:113)
at org.elasticsearch.client.sniff.Sniffer$Task.sniffOnFailure(Sniffer.java:107)
at org.elasticsearch.client.sniff.Sniffer.sniffOnFailure(Sniffer.java:59)
at org.elasticsearch.client.sniff.SniffOnFailureListener.onFailure(SniffOnFailureListener.java:62)
at org.elasticsearch.client.RestClient.onFailure(RestClient.java:491)
at org.elasticsearch.client.RestClient.access$400(RestClient.java:89)
at org.elasticsearch.client.RestClient$1.failed(RestClient.java:374)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.failed(AbstractClientExchangeHandler.java:419)
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.connectionRequestFailed(AbstractClientExchangeHandler.java:335)
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.access$100(AbstractClientExchangeHandler.java:62)
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler$1.failed(AbstractClientExchangeHandler.java:378)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager$InternalPoolEntryCallback.failed(PoolingNHttpClientConnectionManager.java:504)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:168)
at org.apache.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:561)
at org.apache.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:822)
at org.apache.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:183)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:210)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:155)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.lang.Thread.run(Thread.java:745)
We are not able to identify the root cause, and surprised by the fact that only few service instances out of several face the issue for a period of 5 minutes or so. Issue auto resolves after a short amount of time.
Any help in the direction is highly appreciated. Our aim is to provide near consistant latency across all queries.