Best timeout value for setWaitForYellowStatus

Hey Everyone,

When I upgraded my Elasticsearch to 2.3.0 from 2.2.0 and I started seeing slowness in shard allocation. So I added below code in my client to wait till the cluster turns yellow

client.admin().cluster().prepareHealth().setWaitForYellowStatus().setTimeout(TimeValue.timeValueMinutes(1)).execute().actionGet();

I have two questions.

  1. What is the best way to handle this scenario. In 2.3, it took like 5 min to redistribute the shards vs 30 sec in 2.2. By default setWaitForYellowStatus waits for 30s.

  2. How can I increase timeout? It looks like setTimeout is not working and it is falling back to default 30s.

Regards,

Hi,

I suggest you don't rely on timeout but rather poll in a loop. Consider a huge index that would take hours to recover. If you rely on timeout you'd block one of your program threads for hours. The polling approach gives you the option to apply different strategies (e.g. increasing the waiting the period between checks, asking for user input, stopping to poll after a certain timeout etc. etc.). As a rough sketch, I'd do something like this:

while (true) {
  ClusterHealthResponse clusterHealthResponse = client.admin().cluster()
    .prepareHealth()
    .setWaitForYellowStatus()    
    .setTimeout(TimeValue.timeValueMillis(500))
    .execute()
    .actionGet();
  if (clusterHealthResponse.isTimedOut()) {
     //wait for 20 seconds, then retry. You can do lots of fancy 
    // things here - just as I've described above
     Thread.sleep(20 * 1000);
  } else {
    // we've reached yellow status, go on
    break;
  }
}

To your second point: I've set the timeout locally to one minute and measured the time. The request timed out after one minute, just as expected. But as I suggested above I'd rather set a shorter timeout than the default and use polling.

Daniel

Hi Dani,
Thank you for your response. When I try to get the elastic health using above, I am getting exception as follows.

Exception in init thread : java.lang.IllegalStateException: ClusterService was close during health call
        at org.elasticsearch.action.admin.cluster.health.TransportClusterHealthAction$3.onClusterServiceClose(TransportClusterHealthAction.java:155) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:225) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.cluster.service.InternalClusterService.doStop(InternalClusterService.java:208) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:88) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.node.Node.stop(Node.java:300) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.node.Node.close(Node.java:325) [elasticsearch-2.2.0.jar:2.2.0]
        at org.elasticsearch.bootstrap.Bootstrap$4.run(Bootstrap.java:157) [elasticsearch-2.2.0.jar:2.2.0]
{code}

But when I check the elastic log, it was up and the shard allocation was under progress.

Hi,

interesting. How do you connect to Elasticsearch? Do you use the node or the transport client?

Daniel

I am using Transport Client. This doesn't happen everytime though.

Does this also happen with the default timeout? Because the only thing that has changed was the timeout and that you repeat the calls now on timeout.

Yes, It happens on default timeout. What may be the reason? Even though this health check is in the loop, I don't get the status at first call itself. Rather than handling this kind of issue at the client side, I feel it's better to handle at service side.

Hi,

the exception trace indicates that one of your cluster nodes is about to shutdown (and others can still be up). Do you have multiple nodes in your cluster? Did you check the logs on all of them?

If you mean by "service side", that it should be handled by Elasticsearch: It is perfectly fine that nodes leave and join the cluster (you may want to take down a node for maintenance).

Daniel

Yes, I have three nodes in the cluster and look's like all node's were up.

Is there any exception trace in the log that indicates a problem? Can you correlate the log events on the cluster's nodes with the problem on client side?