Stopping and Staring a big cluster : best practice?

What is the best practice for stopping and starting a running cluster?

My setup:
Elasticsearch 90.6

2 Master only nodes - each on their own box
52 Data only nodes - spread across 6 boxes with 12,12,12,2,7,7 nodes on
each

Each node is running under supervision (supervisord) so that they will be
restarted if they crash on their own.

Stop/Start routine:

  • turn off rivers (no more ingest)

  • turn off shard allocation
    "cluster.routing.allocation.disable_allocation":true
    "cluster.routing.allocation.disable_replica_allocation":true

  • stop and restart nodes on each box using supervisorctl
    $>supervisorctl stop elasticsearch-1
    $>supervisorctl start elasticsearch-1

  • wait for "initializing_shards" count to reach 0

  • turn on shard allocation

  • wait for "unassigned_shards" count to reach 0

  • turn on rivers

Result:
We almost always end up with one or a combination of several isssue :

  • nodes pegged on heap and un responsive (cluster cant communicate with
    them, they are not hittable via api)
  • nodes stuck initializing shards forever
  • nodes stuck allocating shards forever
  • "ghost" nodes; a second copy of a node in the cluster state (NOT process
    actually running) with that same name, different id. This actually doesnt
    affect es performance much but it makes es-head and other tools break due
    duplicate node/key name.

Some times, repeated opening and closing and index will get its shards to
allocate and initialize. Sometimes not.

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e8561442-69a3-4ca1-bfbc-06c45bec39e6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shutdown:

curl -XPOST node:9200/_shutdown

In the latest versions (1.0.0.RC1) ES shutdown chooses a strategy in which
order nodes are closed, it makes things less error prone to shut down
current master node at last.

Startup:

with shell execute a for loop over ssh command and start your favorite
wrapper script on remote nodes in parallel. Master-eligible nodes first,
non-master-eligible nodes last.

Jörg

On Fri, Jan 31, 2014 at 11:25 PM, Mark Conlin mark.conlin@gmail.com wrote:

What is the best practice for stopping and starting a running cluster?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZnUtUKKoAHmFK49rayKomv-M1f_8brqLYX3yxKH0i8g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I would add to flush the transaction log after you have indexed all your
content.

--
Ivan

On Fri, Jan 31, 2014 at 4:57 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Shutdown:

curl -XPOST node:9200/_shutdown

In the latest versions (1.0.0.RC1) ES shutdown chooses a strategy in which
order nodes are closed, it makes things less error prone to shut down
current master node at last.

Startup:

with shell execute a for loop over ssh command and start your favorite
wrapper script on remote nodes in parallel. Master-eligible nodes first,
non-master-eligible nodes last.

Jörg

On Fri, Jan 31, 2014 at 11:25 PM, Mark Conlin mark.conlin@gmail.comwrote:

What is the best practice for stopping and starting a running cluster?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZnUtUKKoAHmFK49rayKomv-M1f_8brqLYX3yxKH0i8g%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCW5iP81Ki2vOJXUrRPaV_89VnDkxUizAXMKVNi1Ok24A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I believe the heart of this issue is JVM memory usage.

So does it make sense to delete warmers before shutdown (so they dont try
to warm during initial node recovery)?

Does it make sense to lower my (currently set at 8):
cluster.routing.allocation.node_initial_primaries_recoveries
cluster.routing.allocation.node_concurrent_recoveries

to limit the amount of work any one node will do at once?

Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db2552b0-85c7-4e74-82cd-b194e43d0bf4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.