Hi,
We are running a ES cluster 7.7 and have a service accessing it through the Java rest client (7.7.0).
The service is a high traffic springboot application deployed in our organization's kubernetes clusters. From time to time we get into ioreactor shutdown states and we are adding healthchecks to recover from that state by reinitializing the client.
The problem we have is that if the ioreactor shutdown happens during high traffic and some of our server threads are in the process of calling ElasticSearch, the ES client gets stuck forever (have had threads stuck overnight for 11 hours once), blocking the server threads. Looking at similar looking issues on your repository I tried adding all kinda of timeouts but didn't see any changes in behavior.
For the record, this is how we initialize/reinitialize the client:
public void resetClient() {
if (!connectLock.tryLock()){
return;
}
try {
if (restClient != null) {
try {
this.close();
} catch (IOException e) {
errorLogger.logResponse("Could not close client properly", 422, e);
}
}
RestClientBuilder lowLevelClientBuilder = RestClient.builder(httpHost)
.setRequestConfigCallback(requestConfigBuilder -> requestConfigBuilder
.setConnectionRequestTimeout(settings.getElasticSearchConnectionRequestTimeout()) //10000
.setConnectTimeout(settings.getElasticSearchConnectTimeout()) //10000
.setSocketTimeout(settings.getElasticSearchSocketTimeout())) // 30000
.setHttpClientConfigCallback(httpClientBuilder -> {
PoolingNHttpClientConnectionManager cManager = null;
try {
cManager = createConnectionManager(settings.getElasticSearchIoThreadCount()); //8
} catch (final IOReactorException e) {
errorLogger.logResponse("Error initializing ES client connectionManager. Shutting down", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
throw new IllegalStateException(e);
}
httpClientBuilder
.setConnectionManager(cManager)
.setKeepAliveStrategy((response, context) -> 60000/* 1minute */)
.setMaxConnTotal(settings.getElasticSearchMaxConnTotal())
.setMaxConnPerRoute(settings.getElasticSearchMaxConnPerRoute());
return httpClientBuilder;
});
this.restClient = new RestHighLevelClient(lowLevelClientBuilder);
} finally {
connectLock.unlock();
}
}
private PoolingNHttpClientConnectionManager createConnectionManager(int threadCount) throws IOReactorException {
// Setup with everything just as the builder would do it
SSLContext sslcontext = SSLContexts.createSystemDefault();
PublicSuffixMatcher publicSuffixMatcher = PublicSuffixMatcherLoader.getDefault();
HostnameVerifier hostnameVerifier = new DefaultHostnameVerifier(publicSuffixMatcher);
SchemeIOSessionStrategy sslStrategy = new SSLIOSessionStrategy(sslcontext, null, null, hostnameVerifier);
// Create the custom reactor
IOReactorConfig.Builder configBuilder = IOReactorConfig.custom();
configBuilder.setIoThreadCount(threadCount).
setConnectTimeout(settings.getElasticSearchConnectionRequestTimeout()).
setSoTimeout(settings.getElasticSearchSocketTimeout()).
setSoKeepAlive(true);
DefaultConnectingIOReactor ioreactor = new DefaultConnectingIOReactor(configBuilder.build());
// Setup a generic exception handler that just logs everything so we know this happened
ioreactor.setExceptionHandler(new IOReactorExceptionHandler() {
@Override
public boolean handle(IOException e) {
errorLogger.logResponse("IOReactor exception", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
return false;
}
@Override
public boolean handle(RuntimeException e) {
errorLogger.logResponse("IOReactor exception", HttpStatus.SC_UNPROCESSABLE_ENTITY, e);
return false;
}
});
return new PoolingNHttpClientConnectionManager(
ioreactor,
RegistryBuilder.<SchemeIOSessionStrategy>create()
.register("http", NoopIOSessionStrategy.INSTANCE)
.register("https", sslStrategy)
.build());
}
Health check code runs the following every 3 seconds:
ServiceClient.getRestClient().getLowLevelClient().performRequestAsync(....) {
on both success and failure if there is an exception {
if (e.getMessage().contains("I/O reactor")) {
ServiceClient.resetClient();
}
}
}
I have so far not been able to reproduce this locally and the only way I could reproduce it consistently in our test env was by putting an envoy egress in front of the container, have ES connection go through that, and then run the following:
for i in $(seq 150); do curl -X POST localhost:<serviceport>/route-that-has-server-hit-es > /dev/null & done && curl -X POST localhost:<envoyport>/quitquitquit
I got thread dumps from the stuck threads and they all were stuck here:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:82)
- locked <0x00000000a9521180> (a org.apache.http.concurrent.BasicFuture)
at org.apache.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:244)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
I have now ended up forking the client java code and passing a timeout to elasticsearch/RestClient.java at 7.7 · elastic/elasticsearch · GitHub
Thanks and sorry for the long post. Before creating an issue on your github repo I just want to make sure there is nothing fundamentally wrong with my set up...