Java API ping operation hangs the application on Elasticsearch 7.11.0

I have 3 Java applications (using RestHighLevelClient 7.11.0 client API) connected to the same ElasticSearch server version 7.11.0, and periodically I check the connection to ElasticSearch with ping operation, using client.ping(RequestOptions.DEFAULT).
Sometimes one of these applications hangs when executing this ping. From this moment, this application does't work anymore, due to this check hang it. It never gets any response from Elastic, and no exception was thrown.
But the other applications still work without any problem, because the same ping operation works. It's very strange...

How long is "never"? Did you wait minutes? Hours?

I have this in my code:

private boolean isConnected() {
    try {
        return client.ping(RequestOptions.DEFAULT);
    } catch (IOException e) {
        logger.warn("Elastic Search not connected: " + e.getMessage());
        return false;
    }
}

And it never enter in the catch block. From the Javadoc, IOException is the only exception this method can throws...
I wait some minutes watching the screen, without any result...

Thanks. What operating system are you running this on?

Also can you tell us a bit more about the environment in which your cluster runs? Is it in containers or VMs or bare metal, and what kind of network infra (proxies/service mesh etc) do you have?

David, the ElasticSearch server is installed on a Virtual Machine. I have not the details of the phisical machine, but the virtual one has the following:

  • 1 core Intel Xeon X7560 @ 2.27GHz
  • 4 GB RAM

This s the configuration of a PREPRODUCTION environment.

Elastic is installed in a cluster with a single node.

We upgraded last week from 7.9.3 to 7.11.0. Before this upgrade, we never get this error in the applications.
This is the only reference to something related with the ping that I found: Breaking changes in 7.0 | Elasticsearch Reference [7.11] | Elastic, but I don't know if this is the cause of the error I'm getting.

Thanks, but you forgot to mention the OS you're using, and anything about the network.

The item you linked in the release notes is for the 7.0 release, and is nothing to do with client pings, so I don't think it is relevant here.

Operating system is CentOS. The version I don't know at this moment, but it's recent.
About he network, I don't know. But as I told you, it didn't happen in 7.9.3 version...
Thanks

Ok. What is this thread stuck doing? Can you share a stack trace?

You refer to the thread that is executing the above code?

This code is executed each time a list request arrives to our application in the server (tomcat or jboss). We check if the connection with the Elastic server is alive, and in that case, our server executes the list request over Elastic. Otherwise, it is executed over the database (SqlServer).

The problem is that this client.ping() operation hangs the thread, without throwing any exception.

As I told you, the Javadoc says that this ping operation throws an IOException. Maybe catching an Exception instead of an IOException is better, I don't know...

Sure, but where is it hanging? We need to see a stack trace of that thread while it is stuck so we can understand more deeply what is causing this.

But David, I cannot send it to you because the application is deployed on a server, not in my machine. The hang, occurs when calling client.ping() operation. The stack trace of what is called before it, I think is not relevant; the stack trace of what is called after it, I can't get it debugging the application because I loose the control when it is called. How can I send it to you?

Nevertheless, we go back to a previous version of Elastic, our customers can't wait because the application is not working, you know...

Yes it's the stack trace after this call that we care about. Are you saying you can't even run jstack <PID> to retrieve that? If so, sorry, we're out of luck, I don't think we can help without that information.

Of course, we can execute it on the server, yes. Nevertheless, as I told you, we go back to a previous version, 7.10.2 in that case. If we get the same error, I will execute this command and post here the result, ok?

But I have a question: if my code worked on 7.9.3 without any problem, and after upgrading to 7.11.0 it fails... I think that it's not due to any problem in our side... Didn't you get this problem in your tests?

As I told you it's very strange situation, because we have 3 wars deployed in this jboss, and only one of them hangs. The other 2 wars still works without any problem...

Thanks David.

No, I haven't seen anything like this in any tests, nor can I reproduce it now, so your help in trying to understand what is going on in your environment (and therefore what has changed between the versions) is very much appreciated.

I have installed in our DEV environment the 7.11.0 version again. If I face this problem again, I'll post here the result of this command.

Thank you very much David.

1 Like

We get the same problem with 7.10.2. We have multiple Threads blocked due to this problem... This is the stack trace:

Thread 44194: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
 - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
 - org.apache.http.concurrent.BasicFuture.get() @bci=8, line=82 (Interpreted frame)
 - org.apache.http.impl.nio.client.FutureWrapper.get() @bci=4, line=70 (Interpreted frame)
 - org.elasticsearch.client.RestClient.performRequest(org.elasticsearch.client.RestClient$NodeTuple, org.elasticsearch.client.RestClient$InternalRequest, java.lang.Exception) @bci=48, line=279 (Interpreted frame)
 - org.elasticsearch.client.RestClient.performRequest(org.elasticsearch.client.Request) @bci=17, line=270 (Interpreted frame)
 - org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(java.lang.Object, org.elasticsearch.common.CheckedFunction, org.elasticsearch.client.RequestOptions, org.elasticsearch.common.CheckedFunction, java.util.Set) @bci=24, line=1632 (Interpreted frame)
 - org.elasticsearch.client.RestHighLevelClient.performRequest(org.elasticsearch.client.Validatable, org.elasticsearch.common.CheckedFunction, org.elasticsearch.client.RequestOptions, org.elasticsearch.common.CheckedFunction, java.util.Set) @bci=38, line=1617 (Interpreted frame)
 - org.elasticsearch.client.RestHighLevelClient.ping(org.elasticsearch.client.RequestOptions) @bci=22, line=775 (Interpreted frame)
 - es.prodevelop.pui.elasticsearch.PuiElasticSearchManager.isConnected() @bci=16, line=137 (Interpreted frame)
...

Thanks

I don't know if it could be a problem, but what I face in my side (in my code), is that the BasicFuture class is from httpcore library, and we're overriding the version using the 4.5.13, and ElasticSearch uses the 4.5.12.

Do you think that this could be the problem?

But as I see in 7.9.3 version, Elastic uses the 4.5.10 version and we also uses the 4.5.13 in our application, and it worked fine...

Thanks

One question... One change I made in my code, is to force to refresh the index on each operation that was performed (insert a new document, or update/delete an existing one). I manually executes the following:

...
...
getClient().bulkAsync(request, RequestOptions.DEFAULT, new BulkRequestActionListener(indices));
...
...

private class BulkRequestActionListener implements ActionListener<BulkResponse> {

	private String[] indices;

	public BulkRequestActionListener(String... indices) {
		this.indices = indices;
	}

	@Override
	public void onResponse(BulkResponse response) {
		RefreshRequest refreshReq = new RefreshRequest(indices);
		getClient().indices().refreshAsync(refreshReq, RequestOptions.DEFAULT,
				new ActionListener<RefreshResponse>() {
					@Override
					public void onResponse(RefreshResponse response) {
						// do nothing
					}

					@Override
					public void onFailure(Exception e) {
						// do nothing
					}
				});
		}

	@Override
	public void onFailure(Exception e) {
		// do nothing
	}
}

It's a bad idea to do this? Maybe you ask why I'm doing that. The answer is because I need to have the indices statistics updated, because if I perform a bulk insert operation over an index, and then I perform a count operation, sometimes I don't get the real number of indexed documents.

May this operation crashing the application?

This should work fine, although I do wonder how this is better than calling bulkRequest.setRefreshPolicy(RefreshPolicy.IMMEDIATE) before sending the bulk request in the first place.

I wouldn't expect a difference in the behaviour of futures between 4.5.12 and 4.5.13, although I've not checked the release notes.

The stack trace tells us that it has passed the request off to the HTTP layer and is now waiting for a response. I think if you set logger.org.apache.http: TRACE you'll get more details on whether that request has actually been sent or not. Assuming it has gone out over the wire, note that it will usually be trying to re-use a connection, and the most common problem with that is when the network silently dropped the connection. On Centos by default it takes ≥15 minutes to detect such a bad connection (you can configure this). You can also use TCP keepalives to clean up any dropped connections more eagerly, but note that by default the keepalive interval is 2 hours.

Going back again to 7.9.3 version, we're getting the same error now, so I think that it's more an error in my side that a bug introduced in 7.10.2 or 7.11.0.

I deleted the manual refresh, that is the only change I did in the last week in my applications, and set the bulkRequest.setRefreshPolicy(RefreshPolicy.IMMEDIATE) property that you mentioned.

I will try the applications and see if they fail or not, but after a whole day working with this change, it seems to work. I don't know why a simply programmatic refresh crashed the connection between the Java application and the Elastic server... I don't know where is the root of this error... it's a mistery.