Slow startup (replica recovery in logs)

Ryan_S · October 31, 2013, 6:50pm

We've seen extremely slow startup/initialization/assignment of replica
shards during startup. I can shutdown the cluster cleanly(from a green
state), and then start it back up a few minutes later. It might take
16-24 hours to reach a "green" status with the logs saying replica recovery
is happening. If the cluster was shutdown cleanly and started 10 minutes
later, what recovery needs to occur? Second, is there anything we can tune
to speed this up? I have similar concern on failover, it seems the shard
relocation happens at a snails pace. Our servers can write 4GB+/sec to the
storage, but we are writing data much slower than that. Each data node is
hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and everything is
connected via Infiniband.

When this is occurring our insert rates run at a degraded performance(at
least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan_S · October 31, 2013, 6:51pm

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/assignment of replica
shards during startup. I can shutdown the cluster cleanly(from a green
state), and then start it back up a few minutes later. It might take
16-24 hours to reach a "green" status with the logs saying replica recovery
is happening. If the cluster was shutdown cleanly and started 10 minutes
later, what recovery needs to occur? Second, is there anything we can tune
to speed this up? I have similar concern on failover, it seems the shard
relocation happens at a snails pace. Our servers can write 4GB+/sec to the
storage, but we are writing data much slower than that. Each data node is
hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and everything
is connected via Infiniband.

When this is occurring our insert rates run at a degraded performance(at
least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · November 1, 2013, 6:37am

A few comments:

You should always execute a flush before shutting down any nodes, This
action will clear the transaction logs and commit all operations to
segments.
If you are doing rolling restarts, consider disabling allocation.
Elasticsearch 0.90+ will throttle shard recovery in order not to consume
IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch will only recover 2 shards at a time by default. If you
have a heavily sharded environment, you might want to increase this value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.shevchik@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of replica
shards during startup. I can shutdown the cluster cleanly(from a green
state), and then start it back up a few minutes later. It might take
16-24 hours to reach a "green" status with the logs saying replica recovery
is happening. If the cluster was shutdown cleanly and started 10 minutes
later, what recovery needs to occur? Second, is there anything we can tune
to speed this up? I have similar concern on failover, it seems the shard
relocation happens at a snails pace. Our servers can write 4GB+/sec to the
storage, but we are writing data much slower than that. Each data node is
hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and everything
is connected via Infiniband.

When this is occurring our insert rates run at a degraded performance(at
least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan_S · November 1, 2013, 5:27pm

Ivan,

Thank you for the comments. I was unaware we needed to do a flush before
shutdown. The defaults do look pretty low, so I will tinker with those.

One more question. In the event of node failure, and the standby shards
are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes, This
action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to consume
IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If you
have a heavily sharded environment, you might want to increase this value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S <ryan.s...@gmail.com<javascript:>

wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and everything
is connected via Infiniband.

When this is occurring our insert rates run at a degraded performance(at
least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan_S · November 5, 2013, 6:15pm

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush before
shutdown. The defaults do look pretty low, so I will tinker with those.

One more question. In the event of node failure, and the standby shards
are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes, This
action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If you
have a heavily sharded environment, you might want to increase this value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · November 5, 2013, 7:10pm

Make sure your OS is configured to us appropriate scheduling for SSDs. The
Noop or Deadline scheduler will perform much faster than the default CFQ
(completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the merge
rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby shards

are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches on
fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush before
shutdown. The defaults do look pretty low, so I will tinker with those.

One more question. In the event of node failure, and the standby shards
are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes, This
action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If you
have a heavily sharded environment, you might want to increase this value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan_S · November 5, 2013, 7:30pm

Thanks for the response. We're only having performance issues with
replication/recovery. When the cluster is green our system flies(inserting
2-3 TB per 24 hours). But something is either throttled or just flat out
stuck when re-initializing/assigning replica shards. I'm assuming throttled
because I see the RateLimiter$SImpleRateLimiter.pause being called for a
thread showing [recovery_stream]. I'd like to turn this off completely if
possible.

If I turn off replication on the cluster, let all the replicas drop and
space free, and then turn it back on (quiet system otherwise)......almost
nothing happens. We start merging at 0.4MB, CPU is running 99% idle and
iostat shows almost zero usage.

Looking at the _cluster/state I have
recovery.concurrent_streams: 8
recovery.Max_bytes_per_sec: 2147483648
recovery.translog_ops: 500000
recovery.file_chunk_size: 1048576
store.throttle_type: none

Is there something else I need to set?

PS. Yes, we are using the Deadline Scheduler.

On Tuesday, November 5, 2013 2:10:21 PM UTC-5, Zachary Tong wrote:

Make sure your OS is configured to us appropriate scheduling for SSDs.
The Noop or Deadline scheduler will perform much faster than the
default CFQ (completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the
merge rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby shards

are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches on
fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush
before shutdown. The defaults do look pretty low, so I will tinker with
those.

One more question. In the event of node failure, and the standby shards
are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes,
This action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If you
have a heavily sharded environment, you might want to increase this value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan_S · November 5, 2013, 8:34pm

Looks like I needed node.concurrent_recoveries bumped up from 2 to get
things really going. That presentation was helpful, thanks Zach.

On Tuesday, November 5, 2013 2:30:10 PM UTC-5, Ryan S wrote:

Thanks for the response. We're only having performance issues with
replication/recovery. When the cluster is green our system flies(inserting
2-3 TB per 24 hours). But something is either throttled or just flat out
stuck when re-initializing/assigning replica shards. I'm assuming throttled
because I see the RateLimiter$SImpleRateLimiter.pause being called for a
thread showing [recovery_stream]. I'd like to turn this off completely if
possible.

If I turn off replication on the cluster, let all the replicas drop and
space free, and then turn it back on (quiet system otherwise)......almost
nothing happens. We start merging at 0.4MB, CPU is running 99% idle and
iostat shows almost zero usage.

Looking at the _cluster/state I have
recovery.concurrent_streams: 8
recovery.Max_bytes_per_sec: 2147483648
recovery.translog_ops: 500000
recovery.file_chunk_size: 1048576
store.throttle_type: none

Is there something else I need to set?

PS. Yes, we are using the Deadline Scheduler.

On Tuesday, November 5, 2013 2:10:21 PM UTC-5, Zachary Tong wrote:

Make sure your OS is configured to us appropriate scheduling for SSDs.
The Noop or Deadline scheduler will perform much faster than the
default CFQ (completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the
merge rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby shards

are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches on
fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush
before shutdown. The defaults do look pretty low, so I will tinker with
those.

One more question. In the event of node failure, and the standby
shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes,
This action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If
you have a heavily sharded environment, you might want to increase this
value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · November 5, 2013, 8:55pm

Beat me to the punch Yeah, concurrent_recoveries should be higher to
fully utilize your network and disks in this situation

Two other notes: index and cluster settings are case sensitive...you have
an uppercase in "max_bytes_per_sec", not sure if that was a typo in the
email or in your settings. Would be good to check. Also, it is
"indices.store.throttle.type" not "throttle_type" (note the period instead
of underscore).

-Zach

On Tuesday, November 5, 2013 3:34:52 PM UTC-5, Ryan S wrote:

Looks like I needed node.concurrent_recoveries bumped up from 2 to get
things really going. That presentation was helpful, thanks Zach.

On Tuesday, November 5, 2013 2:30:10 PM UTC-5, Ryan S wrote:

Thanks for the response. We're only having performance issues with
replication/recovery. When the cluster is green our system flies(inserting
2-3 TB per 24 hours). But something is either throttled or just flat out
stuck when re-initializing/assigning replica shards. I'm assuming throttled
because I see the RateLimiter$SImpleRateLimiter.pause being called for a
thread showing [recovery_stream]. I'd like to turn this off completely if
possible.

If I turn off replication on the cluster, let all the replicas drop and
space free, and then turn it back on (quiet system otherwise)......almost
nothing happens. We start merging at 0.4MB, CPU is running 99% idle and
iostat shows almost zero usage.

Looking at the _cluster/state I have
recovery.concurrent_streams: 8
recovery.Max_bytes_per_sec: 2147483648
recovery.translog_ops: 500000
recovery.file_chunk_size: 1048576
store.throttle_type: none

Is there something else I need to set?

PS. Yes, we are using the Deadline Scheduler.

On Tuesday, November 5, 2013 2:10:21 PM UTC-5, Zachary Tong wrote:

Make sure your OS is configured to us appropriate scheduling for SSDs.
The Noop or Deadline scheduler will perform much faster than the
default CFQ (completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the
merge rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby shards

are activated, are those shards then replicated someone else, so they have
a standby? If so, is this configurable? With the amount of data each node
is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches on
fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush
before shutdown. The defaults do look pretty low, so I will tinker with
those.

One more question. In the event of node failure, and the standby
shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes,
This action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If
you have a heavily sharded environment, you might want to increase this
value.

The last two changes will heavily affect IO performance. Increase the
values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.com wrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · November 5, 2013, 9:02pm

Oops, wanted to mention one more thing: if you have configured settings
using the API, they take precedence over the "hard-coded" values in the
elasticsearch.yml file.

Doesn't sound like you are using the config file anyway, but wanted to
mention it just in case because it can lead to confusing results

-Zach

On Tuesday, November 5, 2013 3:55:04 PM UTC-5, Zachary Tong wrote:

Beat me to the punch Yeah, concurrent_recoveries should be higher to
fully utilize your network and disks in this situation

Two other notes: index and cluster settings are case sensitive...you have
an uppercase in "max_bytes_per_sec", not sure if that was a typo in the
email or in your settings. Would be good to check. Also, it is
"indices.store.throttle.type" not "throttle_type" (note the period instead
of underscore).

-Zach

On Tuesday, November 5, 2013 3:34:52 PM UTC-5, Ryan S wrote:

Looks like I needed node.concurrent_recoveries bumped up from 2 to get
things really going. That presentation was helpful, thanks Zach.

On Tuesday, November 5, 2013 2:30:10 PM UTC-5, Ryan S wrote:

Thanks for the response. We're only having performance issues with
replication/recovery. When the cluster is green our system flies(inserting
2-3 TB per 24 hours). But something is either throttled or just flat out
stuck when re-initializing/assigning replica shards. I'm assuming throttled
because I see the RateLimiter$SImpleRateLimiter.pause being called for a
thread showing [recovery_stream]. I'd like to turn this off completely if
possible.

If I turn off replication on the cluster, let all the replicas drop and
space free, and then turn it back on (quiet system otherwise)......almost
nothing happens. We start merging at 0.4MB, CPU is running 99% idle and
iostat shows almost zero usage.

Looking at the _cluster/state I have
recovery.concurrent_streams: 8
recovery.Max_bytes_per_sec: 2147483648
recovery.translog_ops: 500000
recovery.file_chunk_size: 1048576
store.throttle_type: none

Is there something else I need to set?

PS. Yes, we are using the Deadline Scheduler.

On Tuesday, November 5, 2013 2:10:21 PM UTC-5, Zachary Tong wrote:

Make sure your OS is configured to us appropriate scheduling for SSDs.
The Noop or Deadline scheduler will perform much faster than the
default CFQ (completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the
merge rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby

shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches on
fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush
before shutdown. The defaults do look pretty low, so I will tinker with
those.

One more question. In the event of node failure, and the standby
shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes,
This action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch will only recover 2 shards at a time by default. If
you have a heavily sharded environment, you might want to increase this
value.

The last two changes will heavily affect IO performance. Increase
the values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.comwrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/**assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · November 5, 2013, 10:07pm

Sorry, I should have specified the exact setting when I said "Elasticsearch
will only recover 2 shards at a time by default." Glad you found the
setting and that your performance is improving.

Cheers,

Ivan

On Tue, Nov 5, 2013 at 1:02 PM, Zachary Tong zacharyjtong@gmail.com wrote:

Oops, wanted to mention one more thing: if you have configured settings
using the API, they take precedence over the "hard-coded" values in the
elasticsearch.yml file.

Doesn't sound like you are using the config file anyway, but wanted to
mention it just in case because it can lead to confusing results

-Zach

On Tuesday, November 5, 2013 3:55:04 PM UTC-5, Zachary Tong wrote:

Beat me to the punch Yeah, concurrent_recoveries should be higher to
fully utilize your network and disks in this situation

Two other notes: index and cluster settings are case sensitive...you have
an uppercase in "max_bytes_per_sec", not sure if that was a typo in the
email or in your settings. Would be good to check. Also, it is
"indices.store.throttle.type" not "throttle_type" (note the period instead
of underscore).

-Zach

On Tuesday, November 5, 2013 3:34:52 PM UTC-5, Ryan S wrote:

Looks like I needed node.concurrent_recoveries bumped up from 2 to get
things really going. That presentation was helpful, thanks Zach.

On Tuesday, November 5, 2013 2:30:10 PM UTC-5, Ryan S wrote:

Thanks for the response. We're only having performance issues with
replication/recovery. When the cluster is green our system flies(inserting
2-3 TB per 24 hours). But something is either throttled or just flat out
stuck when re-initializing/assigning replica shards. I'm assuming throttled
because I see the RateLimiter$SImpleRateLimiter.pause being called for
a thread showing [recovery_stream]. I'd like to turn this off completely
if possible.

If I turn off replication on the cluster, let all the replicas drop and
space free, and then turn it back on (quiet system otherwise)......almost
nothing happens. We start merging at 0.4MB, CPU is running 99% idle and
iostat shows almost zero usage.

Looking at the _cluster/state I have
recovery.concurrent_streams: 8
recovery.Max_bytes_per_sec: 2147483648
recovery.translog_ops: 500000
recovery.file_chunk_size: 1048576
store.throttle_type: none

Is there something else I need to set?

PS. Yes, we are using the Deadline Scheduler.

On Tuesday, November 5, 2013 2:10:21 PM UTC-5, Zachary Tong wrote:

Make sure your OS is configured to us appropriate scheduling for SSDs.
The Noop or Deadline scheduler will perform much faster than the
default CFQ (completely fair queuing), on the order of 300-500x!

Checkout this presentation by Drew Raines for more details:
https://speakerdeck.com/drewr/life-after-ec2

Do you have metrics about the utilization of your disk/network vs the
merge rate?

Semi-related, you may want to set gateway.recovery_after_nodeshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-afterto help speed up full cluster restarts. This will prevent allocation from
happening until n nodes are in the cluster, which can prevent unneeded
allocation thrashing while nodes reboot. Only useful for full cluster
restarts however.

One more question. In the event of node failure, and the standby

shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Yep, if a primary shard disappears from the cluster (machine catches
on fire, etc), then one of the replicas will be promoted to primary.
Elasticsearch will then recognize that it is missing one of its replicas
and begin allocating/copying a replica somewhere else.

You can control this with various allocation awareness settingshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html,
depending on how you want your cluster to behave when nodes disappear.

-Zach

On Tuesday, November 5, 2013 1:15:12 PM UTC-5, Ryan S wrote:

I still need some additional help here. After changing the
max_bytes_per_sec to 2GB and concurrent_streams to 12 we are still moving
extremely slow. We are merging at 0.4MB/sec. Just closing an index and
opening it (application down) will put the replicas into recovery.......and
take days to initialize and get assigned. Any ideas? Thank you.

On Friday, November 1, 2013 1:27:42 PM UTC-4, Ryan S wrote:

Ivan,

Thank you for the comments. I was unaware we needed to do a flush
before shutdown. The defaults do look pretty low, so I will tinker with
those.

One more question. In the event of node failure, and the standby
shards are activated, are those shards then replicated someone else, so
they have a standby? If so, is this configurable? With the amount of data
each node is managing(locally), I think we would like to avoid this.

Thanks again.

On Friday, November 1, 2013 2:37:28 AM UTC-4, Ivan Brusic wrote:

A few comments:

You should always execute a flush before shutting down any nodes,
This action will clear the transaction logs and commit all operations to
segments.

If you are doing rolling restarts, consider disabling allocation.

Elasticsearch 0.90+ will throttle shard recovery in order not to
consume IO bandwidth. The defaults are pretty low. More info is here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
current/modules-indices.html

Elasticsearch will only recover 2 shards at a time by default. If
you have a heavily sharded environment, you might want to increase this
value.

The last two changes will heavily affect IO performance. Increase
the values without overwhelming your system. Much of it will depend on your
system. SSDs, platters or virtualized environments with shared storage?

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:51 AM, Ryan S ryan.s...@gmail.comwrote:

Sorry, version 90.3

On Thursday, October 31, 2013 2:50:03 PM UTC-4, Ryan S wrote:

We've seen extremely slow startup/initialization/assignment of
replica shards during startup. I can shutdown the cluster cleanly(from a
green state), and then start it back up a few minutes later. It might
take 16-24 hours to reach a "green" status with the logs saying replica
recovery is happening. If the cluster was shutdown cleanly and started 10
minutes later, what recovery needs to occur? Second, is there anything we
can tune to speed this up? I have similar concern on failover, it seems
the shard relocation happens at a snails pace. Our servers can write
4GB+/sec to the storage, but we are writing data much slower than that.
Each data node is hosting about 8TB of data.

A little background about our cluster:

8 nodes
6 data nodes
1 master
1 query node

All are 16 core boxes, with 96GB Ram, 8TB of FusionIO SSD and
everything is connected via Infiniband.

When this is occurring our insert rates run at a degraded
performance(at least 10-15%) which is a big deal for us.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Unassigned Shards Elasticsearch	11	909	July 6, 2017
Why ES node starts recovering all the data from other nodes after reboot? Elasticsearch	17	554	July 6, 2017
Restarting an active node without needing to recover all data remotely Elasticsearch	13	5170	July 6, 2017
Shards Taking a Long Time to Move Between Nodes - Cloud [7.1.1] Elasticsearch	50	4176	July 29, 2019
New User -- Index Settings Reccomdendations and Suggestions Elasticsearch	8	466	July 6, 2017

Slow startup (replica recovery in logs)

Related topics