Trying to optimize configuration for better cluster restart/recovery

At first, I noticed what some have called "shard thrashing," ie during
startup shards are re-allocated as nodes come online.

Have implemented the following by either creating a new setting or
modifying existing settings in elasticsearch.yml

  1. Disable allocation altogether

cluster.routing.allocation.disable_allocation: true

  1. Avoid split-brain in the current 5 node cluster

discovery.zen.minimum_master_nodes: 3

3 Increased Discovery timeout

discovery.zen.ping.timeout: 100s

Specific Objective:
When a cluster restarts, try to force re-use of how the shards were
allocated before shutdown.

Attempt:

  • Tried to increase the discovery.zen.minimum_master_nodes to 5 in a 5 node
    cluster with the idea that if a node could refuse to become operational
    until all 5 nodes in the cluster were recognized.

Result:
Unfortunately, despite making this setting equal to the total number of
nodes in the cluster, I observed shard re-allocation at 4 of the 5 nodes
without waiting for the fifth node to come online. And, this is with
allocation disabled.

Would like an opinion whether what I'm trying to accomplish is even
possible to

  • As much as possible to force a restarted cluster to use existing shards
    as already allocated
  • Start all at once rather than rolling node starts which contributes to
    shard re-allocation.

TIA,
Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aadbb803-f78e-4ddf-a718-69d4a2792f12%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shard allocation should never happen if disable_allocation is enabled.
Which version are you using? Are you doing a rolling restart or a full
cluster restart?

Two things that might help. First is to execute a flush before restarting.
I believe mismatched transaction states will label a shard as incorrect
during a restart. Also play around with the recovery settings [1]. Try
setting gateway.recover_after_nodes (disabled by default).

[1]

Cheers,

Ivan

On Fri, Feb 7, 2014 at 3:11 PM, Tony Su tonysu999@gmail.com wrote:

At first, I noticed what some have called "shard thrashing," ie during
startup shards are re-allocated as nodes come online.

Have implemented the following by either creating a new setting or
modifying existing settings in elasticsearch.yml

  1. Disable allocation altogether

cluster.routing.allocation.disable_allocation: true

  1. Avoid split-brain in the current 5 node cluster

discovery.zen.minimum_master_nodes: 3

3 Increased Discovery timeout

discovery.zen.ping.timeout: 100s

Specific Objective:
When a cluster restarts, try to force re-use of how the shards were
allocated before shutdown.

Attempt:

  • Tried to increase the discovery.zen.minimum_master_nodes to 5 in a 5
    node cluster with the idea that if a node could refuse to become
    operational until all 5 nodes in the cluster were recognized.

Result:
Unfortunately, despite making this setting equal to the total number of
nodes in the cluster, I observed shard re-allocation at 4 of the 5 nodes
without waiting for the fifth node to come online. And, this is with
allocation disabled.

Would like an opinion whether what I'm trying to accomplish is even
possible to

  • As much as possible to force a restarted cluster to use existing shards
    as already allocated
  • Start all at once rather than rolling node starts which contributes to
    shard re-allocation.

TIA,
Tony

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aadbb803-f78e-4ddf-a718-69d4a2792f12%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAVZM303CGVLQn5Ti0cntkFYJ7WPR_EL9LvcyMrCahRtA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ivan,
Thx.

Yes, I have been doing a flush before every cluster shutdown now.
Running ES 1.0 RC1

I have been doing rolling restarts because I have been unable to start all
nodes nearly at once and get all nodes to join even after extending the
timeout as I described. But, as I'm theorizing I'm speculating that doing a
rolling restart is contributing to the shards being re-allocated because
nodes that contain shards for the index may not appear soon enough.

Maybe the entry I made in elasticsearch.yml exactly as I described isn't
correct? I derived it from an ES source that described sending the command
using curl but I thought better to enter directly in elasticsearch.yml

I'll take a look at your link, thx.

Tony

On Friday, February 7, 2014 3:23:24 PM UTC-8, Ivan Brusic wrote:

Shard allocation should never happen if disable_allocation is enabled.
Which version are you using? Are you doing a rolling restart or a full
cluster restart?

Two things that might help. First is to execute a flush before restarting.
I believe mismatched transaction states will label a shard as incorrect
during a restart. Also play around with the recovery settings [1]. Try
setting gateway.recover_after_nodes (disabled by default).

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers,

Ivan

On Fri, Feb 7, 2014 at 3:11 PM, Tony Su <tony...@gmail.com <javascript:>>wrote:

At first, I noticed what some have called "shard thrashing," ie during
startup shards are re-allocated as nodes come online.

Have implemented the following by either creating a new setting or
modifying existing settings in elasticsearch.yml

  1. Disable allocation altogether

cluster.routing.allocation.disable_allocation: true

  1. Avoid split-brain in the current 5 node cluster

discovery.zen.minimum_master_nodes: 3

3 Increased Discovery timeout

discovery.zen.ping.timeout: 100s

Specific Objective:
When a cluster restarts, try to force re-use of how the shards were
allocated before shutdown.

Attempt:

  • Tried to increase the discovery.zen.minimum_master_nodes to 5 in a 5
    node cluster with the idea that if a node could refuse to become
    operational until all 5 nodes in the cluster were recognized.

Result:
Unfortunately, despite making this setting equal to the total number of
nodes in the cluster, I observed shard re-allocation at 4 of the 5 nodes
without waiting for the fifth node to come online. And, this is with
allocation disabled.

Would like an opinion whether what I'm trying to accomplish is even
possible to

  • As much as possible to force a restarted cluster to use existing shards
    as already allocated
  • Start all at once rather than rolling node starts which contributes to
    shard re-allocation.

TIA,
Tony

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aadbb803-f78e-4ddf-a718-69d4a2792f12%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a6f3385b-98d2-4d60-9c11-ccbc34cfa706%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I've verified that shards are re-allocating after a cluster restart (again,
I'm using 1.0 RC1).
To test this specifically, I loaded a small dataset (can take a very long
time to verify results on a large dataset).

Easy to verify.
In a 5 node cluster, load some apache data. (I loaded only a couple dozen
days)
Let the cluster run until all shards are allocated, es-head can be good for
this.
Flush and shutdown the cluster.
Bring up only one node and point es-head at it, should display all 5 shards
for each index residing on the lone active node.
Bring up one additional, then maybe even a second node, refreshing es-head
every 15 seconds or so. Shards are observed first replicating to the second
node, then when the third node is active the shards are again re-allocated
for balancing.

So, either the entry I made into elasticsearch.yml to disable shard
allocation is incorrect, or there is likely a bug.(Or, I might
fundamentally misunderstand what disabling shard re-allocation is supposed
to do).

Maybe I'll re-test on a 0.90 cluster to see if it behaves differently...

Tony

On Friday, February 7, 2014 4:29:57 PM UTC-8, Tony Su wrote:

Hi Ivan,
Thx.

Yes, I have been doing a flush before every cluster shutdown now.
Running ES 1.0 RC1

I have been doing rolling restarts because I have been unable to start all
nodes nearly at once and get all nodes to join even after extending the
timeout as I described. But, as I'm theorizing I'm speculating that doing a
rolling restart is contributing to the shards being re-allocated because
nodes that contain shards for the index may not appear soon enough.

Maybe the entry I made in elasticsearch.yml exactly as I described isn't
correct? I derived it from an ES source that described sending the command
using curl but I thought better to enter directly in elasticsearch.yml

I'll take a look at your link, thx.

Tony

On Friday, February 7, 2014 3:23:24 PM UTC-8, Ivan Brusic wrote:

Shard allocation should never happen if disable_allocation is enabled.
Which version are you using? Are you doing a rolling restart or a full
cluster restart?

Two things that might help. First is to execute a flush before
restarting. I believe mismatched transaction states will label a shard as
incorrect during a restart. Also play around with the recovery settings
[1]. Try setting gateway.recover_after_nodes (disabled by default).

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers,

Ivan

On Fri, Feb 7, 2014 at 3:11 PM, Tony Su tony...@gmail.com wrote:

At first, I noticed what some have called "shard thrashing," ie during
startup shards are re-allocated as nodes come online.

Have implemented the following by either creating a new setting or
modifying existing settings in elasticsearch.yml

  1. Disable allocation altogether

cluster.routing.allocation.disable_allocation: true

  1. Avoid split-brain in the current 5 node cluster

discovery.zen.minimum_master_nodes: 3

3 Increased Discovery timeout

discovery.zen.ping.timeout: 100s

Specific Objective:
When a cluster restarts, try to force re-use of how the shards were
allocated before shutdown.

Attempt:

  • Tried to increase the discovery.zen.minimum_master_nodes to 5 in a 5
    node cluster with the idea that if a node could refuse to become
    operational until all 5 nodes in the cluster were recognized.

Result:
Unfortunately, despite making this setting equal to the total number of
nodes in the cluster, I observed shard re-allocation at 4 of the 5 nodes
without waiting for the fifth node to come online. And, this is with
allocation disabled.

Would like an opinion whether what I'm trying to accomplish is even
possible to

  • As much as possible to force a restarted cluster to use existing
    shards as already allocated
  • Start all at once rather than rolling node starts which contributes to
    shard re-allocation.

TIA,
Tony

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aadbb803-f78e-4ddf-a718-69d4a2792f12%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/60e41b4b-dc7e-4768-83aa-b095e50b2749%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony,

Not sure what is the cause of your problem, but you might want to also
check out this setting in the YML file:

gateway.recover_after_nodes

More details about this particular setting on this video:

http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5efc8b07-65b0-4616-bf3a-ee099d01e7bf%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Update:
Whereas my previous tries to optimize for recovery failed miserably, the "gateway.recover_after_nodes" setting in elasticsearch.yml worked... To a point.

I noticed

  • No ES node was responsive at all after nodes were brought online until the quorum was met.
  • It can take a long time for the ES cluster to agree to a quorum, on my tiny 5 node cluster, it took approx 10 minutes after the nodes were brought online until one started responding to es-head. I poked all the nodes up to that moment so it does seem like the cluster starts up all at once.
  • But, at least in this early case, shard re-allocation and thrashing is not avoided. Before shutting down I didn't carefully retain the shard mapping across nodes but I did notice that once indexing settled down, for most indexes there were as expected 10 shards evenly distributed across the nodes (2/node because for every primary shard there is a replica). On restart, I observed high concentrations of shards on certain nodes and fewer on others, not an even distribution.
  • For approx 9GB of indexed metadata (800mb raw data), it has taken a little over 40 minutes for the cluster to recover to "green" state.

So, mixed and some disappointing results. Since shard re-allocation seems to happen although perhaps less when the gateway_recover_after_nodes setting is enabled and configured, I'm still hoping for something to decrease recovery time further.

Perhaps recovery isn't being done as efficiently as it might.

  1. My impression is that shard content is being evaluated in its full form. If it is, I imagine shard content and its integrity can be evaluated far faster and better by hash.
  2. If hashes are used, I would suggest that they be saved as part of the "flush" command or a separate "flush, snapshot and shutdown ES" command. When a cluster restarts, perhaps the hash table can be used to quickly "snapshot" the existing node and "local data on disk" layout before commencing recovery and moving around shards.
  3. Speaking of which, maybe sometime it could be useful to detail what ES is doing on startup and/or recovery so that we can tinker more intelligently.

Thx,
Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3302e002-cd0d-433a-9f7f-9f6d92c095a6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony,

What you are seeing with the shard recovery is normal - but doesn't mean it
couldn't use more improvement in the future. For now you can throttle the
recovery using a combination of settings (but cannot 100% avoid it).

Just FYI, there is a reason hashing cannot be done (for now) and this is
discussed in this thread (look where Zachary describes the segment
divergence scenario to understand more):

https://groups.google.com/forum/#!topic/elasticsearch/9uF-a5vqfkQ

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f4989c6-4042-44e3-a0ed-25546d6cfa19%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Cool.
Thx all.

Tony
On Tuesday, February 11, 2014 10:38:18 AM UTC-8, Binh Ly wrote:

Tony,

What you are seeing with the shard recovery is normal - but doesn't mean
it couldn't use more improvement in the future. For now you can throttle
the recovery using a combination of settings (but cannot 100% avoid it).

Just FYI, there is a reason hashing cannot be done (for now) and this is
discussed in this thread (look where Zachary describes the segment
divergence scenario to understand more):

Redirecting to Google Groups

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e1af0e57-9f2d-4089-afae-8b87b97579f0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.