Restarting an active node without needing to recover all data remotely

Hi all,

My understanding was that when recovering shards after a node restart only
those documents that have changed in the meantime should need to be synced
from other nodes. In restarting a single node with about 600 GB of data I
am seeing all of the shards getting pulled from other nodes. For instance
in my log I see:

[global-0-10m-15m][3] recovery completed from
[es3.xyz][XLPPfw8yTE2GRs-TWAbkkw][inet[/76.74.248.158:9300]]{dc=sat,
parity=1, master=false}, took[12.6m]#012 phase1: recovered_files [399]
with total_size of [42.2gb], took [12.4m], throttling_wait [0s]#012
: reusing_files [0] with total_size of [0b]#012 phase2: start took
[642ms]#012 : recovered [3459] transaction log operations, took
[10.2s]#012 phase3: recovered [401] transaction log operations, took
[226ms]

Config:

  • ES 0.90.2
  • fairly beefy machines: SSD, 96GB RAM, 1 Gbit links.
  • 100s of index ops per second. I'm not disabling indexing, but the clients
    stop sending to the restarting node once it can no longer be accessed.
  • index templates for creating the indices
  • refresh interval: 60s

Node restart procedure:
curl -XPUT "http://:9200$s/_cluster/settings" -d '{ "persistent" : {
"cluster.routing.allocation.disable_allocation" : true } }'
curl -XPUT "http://:9200$s/_cluster/settings" -d '{ "persistent" : {
"indices.recovery.max_bytes_per_sec" : "80mb" } }'
curl -XPOST "http://:9200$s/_cluster/nodes/_local/_shutdown"
while [[ curl --write-out %{http_code} --silent --output /dev/null "http://${s}:9200" != 200 ]]
do
sleep 1s
done
curl -XPUT "http://:9200$s/_cluster/settings" -d '{ "persistent" : {
"cluster.routing.allocation.disable_allocation" : false } }'
while [[ curl --silent -XGET "http://${s}:9200/_cluster/health" !=
"green" ]]
do
sleep 1s
done

Is there an operation I need to do on the node to ensure that the node will
shutdown in a state where the shards will only need to sync the transaction
logs from the other nodes rather than the whole shard?

Thanks
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Recovery takes place on index level, not on shard level.

One method to minimize index recovery would be to close all indexes a node
holds a shard for, and reopen them after the nodes you want to restart are
up again.

Note also, if clients still send data to other nodes while a node shuts
down, it does not make any advantage for this node to restart, since the
data is distributed to all shards on the other nodes, therefore modifying
the index. Stopping all clients (disabling indexing) and flushing the
indexes before shutdown would help better.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the answers Jörg.

Recovery takes place on index level, not on shard level.

Hmmm, but each recovers node_concurrent_recoveries shards at a time right?
Do you mean that the transaction log is per index rather than per shard?

One method to minimize index recovery would be to close all indexes a node
holds a shard for, and reopen them after the nodes you want to restart are
up again.

Note also, if clients still send data to other nodes while a node shuts
down, it does not make any advantage for this node to restart, since the
data is distributed to all shards on the other nodes, therefore modifying
the index. Stopping all clients (disabling indexing) and flushing the
indexes before shutdown would help better.

Ya, I'd like to avoid closing the indices if possible since there are a lot
of index ops going on, and it seems like having 1 of 16 servers down
shouldn't require stopping all index ops.

Are you saying that if an index is changed at all while a node is down,
then all the shards for that index on the down node must be completely
synced from another node? The local data doesn't get used as all?

Could disabling merges help?

In the log line above 0 bytes of local files were reused in recovering the
shard. The index ops that are ongoing should have affected far far less
than 1% of the docs in that shard. It feels like I must be doing something
wrong for the node to not reuse any of the local data.

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Each node can hold many shards, so, if a node comes up, significant
additional recovery process load may reduce the regular performance of the
node, so recovery speed can be throttled by node_concurrent_recoveries.
The translog is per shard and records the Lucene operations as long as they
are not flushed to the gateway. To regain consistency of an index, all
translogs are replayed on all shards of an index.

To decommission a node, you could isolate the node from receiving activity
by a cluster setting

http://www.elasticsearch.org/guide/reference/index-modules/allocation/

for example, by IP address, with the setting
"cluster.routing.allocation.exclude._ip"

After the cluster setting update command is executed, the ES cluster will
start to move all the shards away from the excluded node. Wait for
completion and then you can shut down the node.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Greg we are facing the same issue, recovery is taking a long time for large
index even though very little data is changed. Were you able to find any
improvements for this?

One way to speed up the recovery is the indices.recovery.max_bytes_per_sec
setting, but doesn't solve the root issue.

On Thursday, August 8, 2013 4:06:23 PM UTC-4, Greg Brown wrote:

Thanks for the answers Jörg.

Recovery takes place on index level, not on shard level.

Hmmm, but each recovers node_concurrent_recoveries shards at a time right?
Do you mean that the transaction log is per index rather than per shard?

One method to minimize index recovery would be to close all indexes a
node holds a shard for, and reopen them after the nodes you want to restart
are up again.

Note also, if clients still send data to other nodes while a node shuts
down, it does not make any advantage for this node to restart, since the
data is distributed to all shards on the other nodes, therefore modifying
the index. Stopping all clients (disabling indexing) and flushing the
indexes before shutdown would help better.

Ya, I'd like to avoid closing the indices if possible since there are a
lot of index ops going on, and it seems like having 1 of 16 servers down
shouldn't require stopping all index ops.

Are you saying that if an index is changed at all while a node is down,
then all the shards for that index on the down node must be completely
synced from another node? The local data doesn't get used as all?

Could disabling merges help?

In the log line above 0 bytes of local files were reused in recovering the
shard. The index ops that are ongoing should have affected far far less
than 1% of the docs in that shard. It feels like I must be doing something
wrong for the node to not reuse any of the local data.

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The issue with slow restarts boils down to segment creation. The segment
creation and merging process is not deterministic between nodes. Even
though two shards have identical data, their segments may have diverged
dramatically. Let's look at a quick example:

You have a primary and replica shard, both empty. You start indexing data
and their segments look like this:

Primary: [doc1, doc2, doc3] [doc4, doc5]
Replica: [doc1, doc2, doc3] [doc4, doc5]

Right now, everything is identical. But indexing continues and the merging
process process kicks off in the primary before the replica (for whatever
reason).

Primary: [doc1, doc2, doc3, doc4, doc5, doc6]
Replica: [doc1, doc2, doc3] [doc4, doc5, doc6]

The primary now has a single segment, while Replica has two. The documents
are identical, but the underlying segments are not. If you shut the
Replica down right now, it's only option is to do a full recovery from the
Primary:

Primary: [doc1, doc2, doc3, doc4, doc5, doc6]

Replica: [doc1, doc2, doc3, doc4, doc5, doc6]

We have to do a full recovery because that is the only way to guarantee
consistency (for example, checksums won't work because they were different
to begin with). However, if you were to do another restart of the Replica
machine and have not indexed any documents, the replica will come back
online immediately because the checksums are identical, so a recovery
process is not needed. However, as soon as you start indexing again, the
segments will likely diverge in state and the next restart will require
some amount of recovery.

Segments are checksummed, so only new segments are recovered. However, the
merging process tends to change a lot of segments at once (ten segments
merged into a single new one) so this trashes a lot of checksums in the
process and can lead to longer recoveries than you might expect.

Does that help?
-Zach

On Wednesday, September 18, 2013 3:50:39 PM UTC-4, Nakul wrote:

Greg we are facing the same issue, recovery is taking a long time for
large index even though very little data is changed. Were you able to find
any improvements for this?

One way to speed up the recovery is the indices.recovery.max_bytes_per_sec
setting, but doesn't solve the root issue.

On Thursday, August 8, 2013 4:06:23 PM UTC-4, Greg Brown wrote:

Thanks for the answers Jörg.

Recovery takes place on index level, not on shard level.

Hmmm, but each recovers node_concurrent_recoveries shards at a time
right? Do you mean that the transaction log is per index rather than per
shard?

One method to minimize index recovery would be to close all indexes a
node holds a shard for, and reopen them after the nodes you want to restart
are up again.

Note also, if clients still send data to other nodes while a node shuts
down, it does not make any advantage for this node to restart, since the
data is distributed to all shards on the other nodes, therefore modifying
the index. Stopping all clients (disabling indexing) and flushing the
indexes before shutdown would help better.

Ya, I'd like to avoid closing the indices if possible since there are a
lot of index ops going on, and it seems like having 1 of 16 servers down
shouldn't require stopping all index ops.

Are you saying that if an index is changed at all while a node is down,
then all the shards for that index on the down node must be completely
synced from another node? The local data doesn't get used as all?

Could disabling merges help?

In the log line above 0 bytes of local files were reused in recovering
the shard. The index ops that are ongoing should have affected far far less
than 1% of the docs in that shard. It feels like I must be doing something
wrong for the node to not reuse any of the local data.

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Zach, that helps explain a lot.

Is their any optimization users can do to reduce the amount of data i.e.
recovered, for e.g. optimize before restart, change # of shards/segments,
or even something as drastic as disable merging during index restart so
only the new segments are re-synced.

On Thursday, September 19, 2013 9:08:47 AM UTC-4, Zachary Tong wrote:

The issue with slow restarts boils down to segment creation. The segment
creation and merging process is not deterministic between nodes. Even
though two shards have identical data, their segments may have diverged
dramatically. Let's look at a quick example:

You have a primary and replica shard, both empty. You start indexing data
and their segments look like this:

Primary: [doc1, doc2, doc3] [doc4, doc5]
Replica: [doc1, doc2, doc3] [doc4, doc5]

Right now, everything is identical. But indexing continues and the
merging process process kicks off in the primary before the replica (for
whatever reason).

Primary: [doc1, doc2, doc3, doc4, doc5, doc6]
Replica: [doc1, doc2, doc3] [doc4, doc5, doc6]

The primary now has a single segment, while Replica has two. The
documents are identical, but the underlying segments are not. If you shut
the Replica down right now, it's only option is to do a full recovery from
the Primary:

Primary: [doc1, doc2, doc3, doc4, doc5, doc6]

Replica: [doc1, doc2, doc3, doc4, doc5, doc6]

We have to do a full recovery because that is the only way to guarantee
consistency (for example, checksums won't work because they were different
to begin with). However, if you were to do another restart of the Replica
machine and have not indexed any documents, the replica will come back
online immediately because the checksums are identical, so a recovery
process is not needed. However, as soon as you start indexing again, the
segments will likely diverge in state and the next restart will require
some amount of recovery.

Segments are checksummed, so only new segments are recovered. However,
the merging process tends to change a lot of segments at once (ten segments
merged into a single new one) so this trashes a lot of checksums in the
process and can lead to longer recoveries than you might expect.

Does that help?
-Zach

On Wednesday, September 18, 2013 3:50:39 PM UTC-4, Nakul wrote:

Greg we are facing the same issue, recovery is taking a long time for
large index even though very little data is changed. Were you able to find
any improvements for this?

One way to speed up the recovery is
the indices.recovery.max_bytes_per_sec setting, but doesn't solve the root
issue.

On Thursday, August 8, 2013 4:06:23 PM UTC-4, Greg Brown wrote:

Thanks for the answers Jörg.

Recovery takes place on index level, not on shard level.

Hmmm, but each recovers node_concurrent_recoveries shards at a time
right? Do you mean that the transaction log is per index rather than per
shard?

One method to minimize index recovery would be to close all indexes a
node holds a shard for, and reopen them after the nodes you want to restart
are up again.

Note also, if clients still send data to other nodes while a node shuts
down, it does not make any advantage for this node to restart, since the
data is distributed to all shards on the other nodes, therefore modifying
the index. Stopping all clients (disabling indexing) and flushing the
indexes before shutdown would help better.

Ya, I'd like to avoid closing the indices if possible since there are a
lot of index ops going on, and it seems like having 1 of 16 servers down
shouldn't require stopping all index ops.

Are you saying that if an index is changed at all while a node is down,
then all the shards for that index on the down node must be completely
synced from another node? The local data doesn't get used as all?

Could disabling merges help?

In the log line above 0 bytes of local files were reused in recovering
the shard. The index ops that are ongoing should have affected far far less
than 1% of the docs in that shard. It feels like I must be doing something
wrong for the node to not reuse any of the local data.

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thu, Sep 19, 2013 at 12:17 PM, Nakul ankush.jhalani@gmail.com wrote:

disable merging during index restart so only the new segments are re-synced

I've been thinking lately about some kind of protocol for orderly shutdown
of a node when it plans to come back up. It looks like turning off
allocation is pretty common. Turning off merges might be useful too.

I think it'd be cool if we could use the node shutdown api to put the
cluster in a state where it expects the node to come back soon. Maybe:

curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown?coming_back_in=120s'

So the cluster can disable allocation, maybe turn off merging, or whatever
else is a good idea, but turn them all back on when it sees the node again
or the timeout has passed. This would be pretty safe.

Any ideas?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Disable allocation will definitely help in case where shards are relocated,
however I believe we are not assuming that to be delay cause in this post.
Here delay is being caused by re-syncing of shards which were never
relocated during the ES restart.

From what I understand, recovery time is roughly proportional to the index
size. Frankly this would make ES not truly scalable.

On Thursday, September 19, 2013 12:41:50 PM UTC-4, Nikolas Everett wrote:

On Thu, Sep 19, 2013 at 12:17 PM, Nakul <ankush....@gmail.com<javascript:>

wrote:

disable merging during index restart so only the new segments are
re-synced

I've been thinking lately about some kind of protocol for orderly shutdown
of a node when it plans to come back up. It looks like turning off
allocation is pretty common. Turning off merges might be useful too.

I think it'd be cool if we could use the node shutdown api to put the
cluster in a state where it expects the node to come back soon. Maybe:

curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown?coming_back_in=120s'

So the cluster can disable allocation, maybe turn off merging, or whatever
else is a good idea, but turn them all back on when it sees the node again
or the timeout has passed. This would be pretty safe.

Any ideas?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is their any optimization users can do to reduce the amount of data i.e.
recovered, for e.g. optimize before restart, change # of shards/segments,
or even something as drastic as disable merging during index restart so
only the new segments are re-synced.

Unfortunately, there really isn't much that can be done right now.
Theoretically a forced-merge down to a single segment would cause the
primary/replica to be identical, but I can almost guarantee that it will be
more expensive than simply recovering the shard. Merges are very expensive
in both CPU and disk I/O, while recovery is basically rsync across your
network (e.g. would you rather merge-sort a terrabyte of data off the disk,
or just read it sequentially and stream over ethernet)

Disabling merging or flushing won't help since segments can diverge fairly
quickly. You would effectively need to disable merging since the beginning
of the index creation which is obviously not a good solution. =)

I've been thinking lately about some kind of protocol for orderly shutdown

of a node when it plans to come back up. It looks like turning off
allocation is pretty common. Turning off merges might be useful too.

Disabling allocation is good to prevent unnecessary shard swapping while
you take nodes down, but it won't prevent this segment recovery problem.

I was talking to some folks the other day and there are definitely things
we can do to make recovery more intelligent. For example, we can save
sequence IDs for the shards and their segments. As a simplistic example,
imagine incrementing a counter every time a shard performs a write
operation. When recovering shards, we can first check the shard-level
counter. If they are identical, you know the shards are identical even if
their underlying segments are physically different.

If the counters are different, then we could use more advanced checksumming
(hash trees) to determine what parts of the data need to be updated.

Because it's a distributed environment, the process would be a good deal
more complex than just a simple counter, but that's the idea. Definitely
agree that it is potential candidate for improvement.

-Zach

On Friday, September 20, 2013 11:42:18 AM UTC-4, Ankush Jhalani wrote:

Disable allocation will definitely help in case where shards are
relocated, however I believe we are not assuming that to be delay cause in
this post. Here delay is being caused by re-syncing of shards which were
never relocated during the ES restart.

From what I understand, recovery time is roughly proportional to the index
size. Frankly this would make ES not truly scalable.

On Thursday, September 19, 2013 12:41:50 PM UTC-4, Nikolas Everett wrote:

On Thu, Sep 19, 2013 at 12:17 PM, Nakul ankush....@gmail.com wrote:

disable merging during index restart so only the new segments are
re-synced

I've been thinking lately about some kind of protocol for orderly
shutdown of a node when it plans to come back up. It looks like turning
off allocation is pretty common. Turning off merges might be useful too.

I think it'd be cool if we could use the node shutdown api to put the
cluster in a state where it expects the node to come back soon. Maybe:

curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown?coming_back_in=120s'

So the cluster can disable allocation, maybe turn off merging, or
whatever else is a good idea, but turn them all back on when it sees the
node again or the timeout has passed. This would be pretty safe.

Any ideas?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the many responses, they were very helpful.

For posterity, I wrote up a more detailed post of how we are managing
restart times for our
cluster: http://gibrown.wordpress.com/2013/12/05/managing-elasticsearch-cluster-restart-time/

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/42802386-323d-4917-b562-dec70e888600%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just wanted to add a quick note: long recovery times (due to divergence of
shards between primary/replica) is an issue that we will be an addressing.
No ETA as of yet, but something that is on the roadmap. :slight_smile:

-Zach

On Wednesday, December 4, 2013 7:48:04 PM UTC-5, Greg Brown wrote:

Thanks for the many responses, they were very helpful.

For posterity, I wrote up a more detailed post of how we are managing
restart times for our cluster:
Managing Elasticsearch Cluster Restart Time – greg.blog

-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c0336689-f647-4b60-837e-af8c2af6a9dc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

That would be very nice to have, thanks for the update.

On Thu, Jan 9, 2014 at 2:05 PM, Zachary Tong zacharyjtong@gmail.com wrote:

Just wanted to add a quick note: long recovery times (due to divergence of
shards between primary/replica) is an issue that we will be an addressing.
No ETA as of yet, but something that is on the roadmap. :slight_smile:

-Zach

On Wednesday, December 4, 2013 7:48:04 PM UTC-5, Greg Brown wrote:

Thanks for the many responses, they were very helpful.

For posterity, I wrote up a more detailed post of how we are managing
restart times for our cluster: December 5, 2013 – greg.blog
managing-elasticsearch-cluster-restart-time/

-Greg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9uF-a5vqfkQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c0336689-f647-4b60-837e-af8c2af6a9dc%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE779yBvAuhMOXR47J6kFgqg4MLr0c8nJ1psxKKJXXGXwyryZA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.