Stopping and Staring a big cluster : best practice?

Mark_Conlin · January 31, 2014, 10:25pm

What is the best practice for stopping and starting a running cluster?

My setup:
Elasticsearch 90.6

2 Master only nodes - each on their own box
52 Data only nodes - spread across 6 boxes with 12,12,12,2,7,7 nodes on
each

Each node is running under supervision (supervisord) so that they will be
restarted if they crash on their own.

Stop/Start routine:

turn off rivers (no more ingest)
turn off shard allocation
"cluster.routing.allocation.disable_allocation":true
"cluster.routing.allocation.disable_replica_allocation":true
stop and restart nodes on each box using supervisorctl
$>supervisorctl stop elasticsearch-1
$>supervisorctl start elasticsearch-1
wait for "initializing_shards" count to reach 0
turn on shard allocation
wait for "unassigned_shards" count to reach 0
turn on rivers

Result:
We almost always end up with one or a combination of several isssue :

nodes pegged on heap and un responsive (cluster cant communicate with
them, they are not hittable via api)
nodes stuck initializing shards forever
nodes stuck allocating shards forever
"ghost" nodes; a second copy of a node in the cluster state (NOT process
actually running) with that same name, different id. This actually doesnt
affect es performance much but it makes es-head and other tools break due
duplicate node/key name.

Some times, repeated opening and closing and index will get its shards to
allocate and initialize. Sometimes not.

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e8561442-69a3-4ca1-bfbc-06c45bec39e6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 1, 2014, 12:57am

Shutdown:

curl -XPOST node:9200/_shutdown

In the latest versions (1.0.0.RC1) ES shutdown chooses a strategy in which
order nodes are closed, it makes things less error prone to shut down
current master node at last.

Startup:

with shell execute a for loop over ssh command and start your favorite
wrapper script on remote nodes in parallel. Master-eligible nodes first,
non-master-eligible nodes last.

Jörg

On Fri, Jan 31, 2014 at 11:25 PM, Mark Conlin mark.conlin@gmail.com wrote:

What is the best practice for stopping and starting a running cluster?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZnUtUKKoAHmFK49rayKomv-M1f_8brqLYX3yxKH0i8g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 1, 2014, 3:04am

I would add to flush the transaction log after you have indexed all your
content.

--
Ivan

On Fri, Jan 31, 2014 at 4:57 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Shutdown:

curl -XPOST node:9200/_shutdown

In the latest versions (1.0.0.RC1) ES shutdown chooses a strategy in which
order nodes are closed, it makes things less error prone to shut down
current master node at last.

Startup:

with shell execute a for loop over ssh command and start your favorite
wrapper script on remote nodes in parallel. Master-eligible nodes first,
non-master-eligible nodes last.

Jörg

On Fri, Jan 31, 2014 at 11:25 PM, Mark Conlin mark.conlin@gmail.comwrote:

What is the best practice for stopping and starting a running cluster?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZnUtUKKoAHmFK49rayKomv-M1f_8brqLYX3yxKH0i8g%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCW5iP81Ki2vOJXUrRPaV_89VnDkxUizAXMKVNi1Ok24A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark_Conlin · February 1, 2014, 7:21pm

I believe the heart of this issue is JVM memory usage.

So does it make sense to delete warmers before shutdown (so they dont try
to warm during initial node recovery)?

Does it make sense to lower my (currently set at 8):
cluster.routing.allocation.node_initial_primaries_recoveries
cluster.routing.allocation.node_concurrent_recoveries

to limit the amount of work any one node will do at once?

Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db2552b0-85c7-4e74-82cd-b194e43d0bf4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Unexpected cluster behavior Elasticsearch	3	345	July 6, 2017
Rolling restart Elasticsearch	5	437	July 6, 2017
Stopping the entire cluster without any rebalancing Elasticsearch	11	1717	July 6, 2017
When restarting a test cluster, stop all nodes before starting any new nodes Elasticsearch	2	330	July 6, 2017
Elasticsearch rolling restart problem Elasticsearch	5	437	July 6, 2017

Stopping and Staring a big cluster : best practice?

Related topics