Threads stuck waiting for response

Hi,

We are running a ES cluster 7.7 and have a service accessing it through the Java rest client (7.7.0).
The service is a high traffic springboot application deployed in our organization's kubernetes clusters. From time to time we get into ioreactor shutdown states and we are adding healthchecks to recover from that state by reinitializing the client.

The problem we have is that if the ioreactor shutdown happens during high traffic and some of our server threads are in the process of calling ElasticSearch, the ES client gets stuck forever (have had threads stuck overnight for 11 hours once), blocking the server threads. Looking at similar looking issues on your repository I tried adding all kinda of timeouts but didn't see any changes in behavior.

For the record, this is how we initialize/reinitialize the client:

        public void resetClient() {
        if (!connectLock.tryLock()){
            return;
        }
        try {
            if (restClient != null) {
                try {
                    this.close();
                } catch (IOException e) {
                    errorLogger.logResponse("Could not close client properly", 422, e);
                }
            }

            RestClientBuilder lowLevelClientBuilder = RestClient.builder(httpHost)
                    .setRequestConfigCallback(requestConfigBuilder -> requestConfigBuilder
                            .setConnectionRequestTimeout(settings.getElasticSearchConnectionRequestTimeout()) //10000
                            .setConnectTimeout(settings.getElasticSearchConnectTimeout()) //10000
                            .setSocketTimeout(settings.getElasticSearchSocketTimeout())) // 30000
                    .setHttpClientConfigCallback(httpClientBuilder -> {
                        PoolingNHttpClientConnectionManager cManager = null;
                        try {
                            cManager = createConnectionManager(settings.getElasticSearchIoThreadCount()); //8
                        } catch (final IOReactorException e) {
                            errorLogger.logResponse("Error initializing ES client connectionManager. Shutting down", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
                            throw new IllegalStateException(e);
                        }
                        httpClientBuilder
                                .setConnectionManager(cManager)
                                .setKeepAliveStrategy((response, context) -> 60000/* 1minute */)
                                .setMaxConnTotal(settings.getElasticSearchMaxConnTotal())
                                .setMaxConnPerRoute(settings.getElasticSearchMaxConnPerRoute());

                        return httpClientBuilder;
                    });
            this.restClient = new RestHighLevelClient(lowLevelClientBuilder);
        } finally {
            connectLock.unlock();
        }
    }

    private PoolingNHttpClientConnectionManager createConnectionManager(int threadCount) throws IOReactorException {
        // Setup with everything just as the builder would do it
        SSLContext sslcontext = SSLContexts.createSystemDefault();
        PublicSuffixMatcher publicSuffixMatcher = PublicSuffixMatcherLoader.getDefault();
        HostnameVerifier hostnameVerifier = new DefaultHostnameVerifier(publicSuffixMatcher);
        SchemeIOSessionStrategy sslStrategy = new SSLIOSessionStrategy(sslcontext, null, null, hostnameVerifier);

        // Create the custom reactor
        IOReactorConfig.Builder configBuilder = IOReactorConfig.custom();
        configBuilder.setIoThreadCount(threadCount).
            setConnectTimeout(settings.getElasticSearchConnectionRequestTimeout()).
            setSoTimeout(settings.getElasticSearchSocketTimeout()).
            setSoKeepAlive(true);

        DefaultConnectingIOReactor ioreactor = new DefaultConnectingIOReactor(configBuilder.build());

        // Setup a generic exception handler that just logs everything so we know this happened
        ioreactor.setExceptionHandler(new IOReactorExceptionHandler() {
            @Override
            public boolean handle(IOException e) {
                errorLogger.logResponse("IOReactor exception", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
                return false;
            }

            @Override
            public boolean handle(RuntimeException e) {
                errorLogger.logResponse("IOReactor exception", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
                return false;
            }
        });

        return new PoolingNHttpClientConnectionManager(
                ioreactor,
                RegistryBuilder.<SchemeIOSessionStrategy>create()
                        .register("http", NoopIOSessionStrategy.INSTANCE)
                        .register("https", sslStrategy)
                        .build());
    }

Health check code runs the following every 3 seconds:

    ServiceClient.getRestClient().getLowLevelClient().performRequestAsync(....) {
        on both success and failure if there is an exception {
            if (e.getMessage().contains("I/O reactor")) {
                ServiceClient.resetClient();
            }
        }
    }

I have so far not been able to reproduce this locally and the only way I could reproduce it consistently in our test env was by putting an envoy egress in front of the container, have ES connection go through that, and then run the following:

    for i in $(seq 150); do curl -X POST localhost:<serviceport>/route-that-has-server-hit-es >  /dev/null & done && curl -X POST localhost:<envoyport>/quitquitquit

I got thread dumps from the stuck threads and they all were stuck here:

    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:82)
    - locked <0x00000000a9521180> (a org.apache.http.concurrent.BasicFuture)
    at org.apache.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:244)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)

I have now ended up forking the client java code and passing a timeout to elasticsearch/RestClient.java at 7.7 · elastic/elasticsearch · GitHub

Thanks and sorry for the long post. Before creating an issue on your github repo I just want to make sure there is nothing fundamentally wrong with my set up...

1 Like

I think this question came up recently in a (mostly-unrelated) Github Issue. AFAIK closing the client should cancel any outstanding requests, but if that isn't the case, and you can show us how to reproduce it, then please open an issue. Note that TCP keepalives are definitely required to avoid infinite waits, so make sure you have configured them (see the OP of that same issue for how to do that).

Why not just use performRequestAsync()? It already returns a Cancellable which you can use to implement a timeout if you need such a thing.


EDIT: I haven't tried this, but I think from looking at the code that timing out the get() you linked doesn't actually cancel the underlying request, whereas cancelling the Cancellable returned from performRequestAsync() does. Not cancelling the request means you leak the connection so it doesn't return to the pool, and the pool has bounded size, so I think you can eventually run out of connections if they really are hanging forever.

Hmm that's suspicious too, I would say that we'd need you to reproduce this without any intermediaries before considering it an Elasticsearch issue. It's definitely possible for a buggy or misconfigured proxy to break the liveness guarantees that TCP keepalives give, and many such proxies seem to have trouble with the kinds of long-running HTTP requests that Elasticsearch uses.

Oh thanks David. TBH I had avoided the low level client's performRequestAsync just to not have to refactor our own internal code, and hadn't thought about leaking connections themselves.

And yeah I think that issue I had posted on in github is not directly related. I had just seen someone run into the same symptoms and commented without checking the context closely. Sorry about that!

It definitely happens without envoy egress too (once every couple of weeks in our prod env), but I have only been able to find a way to reproduce through envoy unfortunately. And in case it helps, our server doesn't have any long running queries against elasticsearch (our 99th percentile is around 400ms).

I'll try to see if I can dig more on my own or find a way to reproduce without envoy. Thanks!

It should be a fairly localised change to implement a timeout using performRequestAsync, you can just pass it a future and then use one of the block-with-a-timeout methods on the future.

Ahah, I hadn't noticed that was you too :slight_smile:

Can you reproduce it without encryption? If so, it might help to take simultaneous packet captures on the client and on Elasticsearch using tcpdump, so you can see what's actually happening on the wire. You could do that with or without Envoy, at least this way if the problem was in the middle we'd see it straight away.