TimeTaken by a Cluster State Update Task

ananth · August 31, 2015, 6:58am

Hi,

Last week we had a update in our application which uses node client to communicate Es Cluster.
Though we have 15 node clients , we will update one by one). Application update finished in all 15 machines with-in ten minutes. But after restart , no one joins cluster upto 40 mins.

On looking into masters log , all the cluster state update tasks took more than 3 mins .

Example:

cluster update task [zen-disco-node_failed([esClient_MachineIp] [NodeName] ... ,reason transport disconnected] took 3.7m above the warn threshold of 30s

I have the following settings in elasticserch.yml
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

warkolm · August 31, 2015, 7:12am

Can you explain this a bit more, what is happening here? What do you expect? It's just not really clear to me

ananth · August 31, 2015, 7:26am

Thanks for the reply .

We have 3 group of machines in our application .

Console grid - End User use to search logs (3 machines with node client)
Indexer grid - Used to index logs to es (6 machines with node client)
Monitor grid - Used to trigger scheduled searches according to result we alert users ( no of 500 status in last 15 mins , No of exceptions occurred in last 15 mins etc ..) (6 machines with node client)

All 3 kind of grid can refer mysql , In our application we had a schema change due to that we must restart all 3 grids.
(restart happens one by one). Once the restart completes, it took nearly 40 mins for each node to join into es cluster. Thus End-User unable to search for 40 mins also we unable to trigger a alert in right time, and indexing delayed by 40m.

warkolm · August 31, 2015, 7:27am

That seems unusual. Are they all in the same DC? Was there networking issues?
Did you check the logs of the cluster master?

ananth · August 31, 2015, 7:33am

Hope this clears.

zen-disco-node_failed / node join cluster update tasks combinely took 40 mins for 15 node clients .

ananth · August 31, 2015, 7:34am

yes all are in same DC . On looking master logs all the pending tasks took more than 3m. There is no network issues.

Will the following properties affects node disconnection ( 60s * 5 - 5mins )?

discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

warkolm · August 31, 2015, 8:17am

Yep they will.

Is there any reason why you increased those from the defaults?

ananth · August 31, 2015, 9:42am

Yes long back we have faced long gc run issues in few data nodes.

If i remember correctly few nodes suffered by long gc runs (took 2+ mins) so we decided to increase node gc survival time from default (93 secs (ping_timeout-30 secs , interval 1s and retry 3) to 5 mins.

If i understand correctly master marks datanode as dead for long gc runs i.e. greater than 93 secs (Am i right ?)
In this case unnecessary shard movement will be triggered which in-turn affects regular indexing/Searching activity.

Thats why we increased those values. Is it possible to have different values for data node and client node ?

warkolm · August 31, 2015, 11:31am

That makes some sense then, are you still getting long GCs?

ananth · August 31, 2015, 12:14pm

Not recently . But any time it is possible since our our jvm options are as follows, Xmx and Xms values are 31GB and -Xmn is 14GB.

warkolm · August 31, 2015, 9:10pm

I'd suggest you drop heap to 30.5GB or even 30GB. Above 30.5 you hit the compressed pointers issue. Also don't touch any other settings for the JVM.

You really want some kind of monitoring here, what is happening to your nodes and the cluster, to give you better correlation.

ananth · September 1, 2015, 6:19am

I went through the following link and gave 31G as heap.

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

Thanks for the suggestion Mark , I will reduce jvm heap to 30G . I forget to mention about the following property

**-XX:CMSInitiatingOccupancyFraction=78 ** (java 1.7.0_55). I hope this wont be a problem for long gc.

Topic		Replies	Views
Slow cluster startup with zen discovery and large number of nodes Elasticsearch	4	1145	July 6, 2017
Cluster stalls when nodes are removed (or the true meaning of expected_nodes) Elasticsearch	10	537	July 6, 2017
Frequent disconnects between nodes Elasticsearch	13	2340	July 6, 2017
Inconsistent search cluster status and search results after long GC run Elasticsearch	5	790	July 6, 2017
ElasticSearch cannot join cluster Elasticsearch	7	601	July 6, 2017

TimeTaken by a Cluster State Update Task

Related topics