ES5 RestClient retry behavior with multiple nodes: Fix suggestion

Hi,

I have been testing the behavior of the RestClient (ES 5) when a node that is part of a two node cluster is not running. I have configured the RestClient with two nodes:

localhost:9200
localhost:9321

When I construct the RestClient the node 9200 is not running:

restClient = RestClient.builder(hosts)
.setFailureListener(failureListener)
.build();

I then run:

restClient.performRequest("DELETE", "/my_index", new Hashtable<>());

The following exception is thrown even though the index is deleted:

java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at org.elasticsearch.client.RestClient.addSuppressedException(RestClient.java:470)
at org.elasticsearch.client.RestClient.access$700(RestClient.java:84)
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.trackFailure(RestClient.java:567)
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:559)
at org.elasticsearch.client.RestClient$1.retryIfPossible(RestClient.java:352)
at org.elasticsearch.client.RestClient$1.failed(RestClient.java:331)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:179)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:745)
Suppressed: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
... 1 more
[CIRCULAR REFERENCE:java.net.ConnectException: Connection refused]

I traced through the code to see at what point the index was deleted and as far as I can tell the sequence is as follows:

  1. Try deleting the index on 9200.
  2. A call to public void failed(Exception failure) indicates that 9200 failed.
  3. The request is retried on 9321.
  4. A call to public void failed(Exception failure) indicates that 9321 failed.

Before the retry the index still exists, but after the retry it does not.

curl -I 'http://localhost:9321/my_index'
HTTP/1.1 200 OK
content-type: text/plain; charset=UTF-8
content-length: 0

curl -I 'http://localhost:9321/my_index'
HTTP/1.1 404 Not Found
content-type: text/plain; charset=UTF-8
content-length: 0

Even though the index was successfully deleted the method is still returning an exception.

Am I using the rest client incorrectly?

Cheers,

Stuart

Hi,

Not sure if it makes a difference but my failure difference is defined as follows:

/**
 * HttpHost failure listener for the ElasticSearch client API
 */
private class MyRestClientFailureListener extends RestClient.FailureListener
{
    /**
     * Notifies that the host provided as argument has just failed
     */
    public void onFailure(HttpHost host)
    {
        //log failure
    }
}

Has anyone tried the ES5 RestClient against multiple nodes with the 'first' one shutdown? Does it work for you?

OK, so I have hacked around with the RestClient code and by adding one line it seems to behave as I would expect in an failover situation:

        private void retryIfPossible(Exception exception, Iterator<HttpHost> hosts, HttpRequestBase request) {
            if (hosts.hasNext()) {
                //in case we are retrying, check whether maxRetryTimeout has been reached
                long timeElapsedMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
                long timeout = maxRetryTimeoutMillis - timeElapsedMillis;
                if (timeout <= 0) {
                    IOException retryTimeoutException = new IOException(
                            "request retries exceeded max retry timeout [" + maxRetryTimeoutMillis + "]");
                    listener.onDefinitiveFailure(retryTimeoutException);
                } else {
                    listener.trackFailure(exception);
                    request.reset();
                    HttpAsyncResponseConsumer<HttpResponse> newResponseConsumer = new HeapBufferedAsyncResponseConsumer();
                    performRequestAsync(startTime, hosts, request, newResponseConsumer, listener);
                }
            } else {
                listener.onDefinitiveFailure(exception);
            }
        }

I added the line creating a newResponseConsumer just before calling performRequestAsync. Is this a valid fix? My theory was that passing the old response consumer a second time did not work because it 'remembered' that an exception had already been thrown.

If this is a valid fix what is the best way to get this released?

Cheers,

Stuart

Hi Stuart,
I can confirm this is a bug.

Your potential fix showed where the problem is, but unfortunately it is not enough as we allow users to provide their own response consumer, which as you found out cannot be reused across multiple retries.

Working on a fix, also will need to find out why our tests didn't catch this... that is quite disappointing. Will fix all of this asap.

Hi Luca,

Thanks very much for confirming that this is a bug and that you are working on a fix (My backup plan was to write my own HA client and I was not looking forward to doing this). I will keep an eye out for when you release this.

Again thanks for letting me know as I can now concentrate on other things.

Cheers,

Stuart

A fix for this bug has just been merged, unfortunately it is too late for 5.0.1 but it will come with 5.0.2 and 5.1 .

Cheers
Luca

Fantastic response time, thank you.

Hi,
I have good news, the fix was released today with 5.0.1 :wink:

:thumbsup: :slight_smile: