Configuration params to address slow node start

fassisrosa · January 6, 2017, 7:29pm

Hey there,

I am working on a single node cluster that currently has about 700 indexes each with 5 shards. I am seeing slow starts of the Elasticsearch node (up to 4m30s). I have found the following parameter in documentation:

cluster.routing.allocation.node_initial_primaries_recoveries

(see https://www.elastic.co/guide/en/elasticsearch/reference/5.x/shards-allocation.html).

I see it's default value is 4 allowing for 4 concurrent threads to recover primaries from disk. I bumped this up to a higher number and I am getting way better node restart times (starting now in about 1m30s with setting value equal to 128). I am completely winging it on the number of threads here and playing with numbers from 16 to as high as 256 (although I noticed that for my example once I reach a certain number of threads I no longer gain any significant improvement).

Is there any maximum number recommendation for this parameter (maybe cores on server?). I wonder why 4 is the default?

Also, is there any other configuration parameter you know about that might help speed up the node restart?

Thanks,

Francisco.

jasontedor · January 6, 2017, 10:51pm

Are you on SSDs or spinning disks? Can you take a stack dump (jstack) during the slow recovery and share it here?

fassisrosa · January 9, 2017, 3:41pm

Hey Jason,

Thanks for the quick reply.

We are on spinning disks. I have a zip file containing captures of jstack.
Here is what I did:

Setup ES with param set to 4. Started ES, captured jstack every 5s.
Setup ES with param set to 128. Started ES, captured jstack every 5s.

Each directory inside the zip file will contain the captured jstack outputs for the parm setting. Files are named based on capture number. I also included the ES starting log so it can be cross-referenced with jstack captures.

I tried uploading the zip file to the forum but it did not allow me to do so. I think having these captures every 5s may give you more info than an isolated capture that might not see the issue. Is there any preferred way for me to share this zip file?

Please let me know if there is anything else I can supply to help understand this scenario.

Thanks,

Francisco.

fassisrosa · January 11, 2017, 1:41pm

Hey there,

Any of the jstack outputs is too big for the forum message limit... Also, you would only get one and might miss the whole picture. I have placed a zip file with the output of the 5s capture here:

https://www.dropbox.com/s/hhxbizu2o8n7gaf/jstack-results.zip?dl=0

Big waiting time happens somewhere between ES log line:

[2017-01-09T10:00:14,285][INFO ][o.e.g.GatewayService ] [WIN2K12R2IMAGE] recovered [665] indices into cluster_state

and log line:

[2017-01-09T10:04:32,149][INFO ][o.e.c.r.a.AllocationService] [WIN2K12R2IMAGE] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[...]).

Thanks,

Francisco.

jasontedor · January 12, 2017, 12:57pm

Your node is indeed spending a lot of time recovering the shards from disk (reading shard metadata, reading the Lucene commit point, etc.). Other than what you've done here, short of reducing the number of indices or getting faster disks, there is not much that you can do here.

fassisrosa · January 12, 2017, 3:19pm

Thanks Jason! So no other config params that might help us here?

We are setting 'cluster.routing.allocation.node_initial_primaries_recoveries' to 128 with good results. Would you have any objections against that number?

Thanks again,

Francisco.

jasontedor · January 12, 2017, 3:30pm

Spinning disks do not like concurrency, my only concern would be about thrashing the disk.

fassisrosa · January 12, 2017, 5:40pm

Thanks Jason!

jasontedor · January 12, 2017, 5:52pm

You're very welcome.

system · February 9, 2017, 5:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very slow cluster restart Elasticsearch	4	4537	July 6, 2017
ES instance restart causing shard initialization Elasticsearch	7	2084	July 5, 2017
Restarting many nodes Elasticsearch	3	280	July 19, 2018
Slow cluster startup (again) Elasticsearch	5	3127	July 6, 2017
Is it me or is ES 1.6.0 node startup/recovery slower then before? Elasticsearch	15	1078	July 6, 2017

Configuration params to address slow node start

Related topics