Preventing ES Shard Dance


(Otis Gospodnetić) #1

Hello,

We've been doing a lot of ElasticSearch performance testing lately. While
testing, we've experienced the "ES shard dance" shown in the attachment
whenever we restarted any of the nodes. This, of course, made testing hard
because we couldn't keep a fixed shard distribution between restarts and
between some of the test runs, plus it slowed us down (you can see this
shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards
around even if it thinks there is a better way to distribute them?

Thanks,
Otis

Hiring ElasticSearch Engineers World-Wide --


(dobe) #2

hi otis

this is described
here https://github.com/elasticsearch/elasticsearch/issues/1358

On Tuesday, March 6, 2012 12:13:24 PM UTC+1, Otis Gospodnetic wrote:

Hello,

We've been doing a lot of ElasticSearch performance testing lately. While
testing, we've experienced the "ES shard dance" shown in the attachment
whenever we restarted any of the nodes. This, of course, made testing hard
because we couldn't keep a fixed shard distribution between restarts and
between some of the test runs, plus it slowed us down (you can see this
shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards
around even if it thinks there is a better way to distribute them?

Thanks,
Otis

Hiring ElasticSearch Engineers World-Wide --
http://sematext.com/about/jobs.html#search


(Mark Huang) #3

Disabling replica allocation will avoid rebalancing on cluster shutdown as dobe points out, but not on cluster restart, if the nodes don't all come up around the same time. Set the gateway parameters as described at http://www.elasticsearch.org/guide/reference/modules/gateway/index.html

To give all your nodes time to initialize before recovery/rebalancing is performed. For example, in a cluster with 3 nodes and 1 replica per shard, you might set the parameters to:

gateway:
recover_after_nodes: 2
recover_after_time: 5m
expected_nodes: 3

Which would give the 3rd node up to 5 minutes to finish initializing before the first 2 nodes give up on it and start rebalancing.

--Mark

On Mar 6, 2012, at 4:32 AM, dobe wrote:

hi otis

this is described here https://github.com/elasticsearch/elasticsearch/issues/1358

On Tuesday, March 6, 2012 12:13:24 PM UTC+1, Otis Gospodnetic wrote:
Hello,

We've been doing a lot of ElasticSearch performance testing lately. While testing, we've experienced the "ES shard dance" shown in the attachment whenever we restarted any of the nodes. This, of course, made testing hard because we couldn't keep a fixed shard distribution between restarts and between some of the test runs, plus it slowed us down (you can see this shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards around even if it thinks there is a better way to distribute them?

Thanks,
Otis

Hiring ElasticSearch Engineers World-Wide -- http://sematext.com/about/jobs.html#search


(Mark Waddle) #4

Hi Otis,

What tooling are you using to gather and chart those metrics?

Mark

On Tuesday, March 6, 2012 3:13:24 AM UTC-8, Otis Gospodnetic wrote:

Hello,

We've been doing a lot of ElasticSearch performance testing lately. While
testing, we've experienced the "ES shard dance" shown in the attachment
whenever we restarted any of the nodes. This, of course, made testing hard
because we couldn't keep a fixed shard distribution between restarts and
between some of the test runs, plus it slowed us down (you can see this
shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards
around even if it thinks there is a better way to distribute them?

Thanks,
Otis

Hiring ElasticSearch Engineers World-Wide --
http://sematext.com/about/jobs.html#search


(Otis Gospodnetić) #5

Hi Mark,

That graph came from SPM for ElasticSearch. It's like SPM for Solr (
http://sematext.com/spm/solr-performance-monitoring/index.html), but with
ES metrics. It's not 100% polished, but that's happening as I type. It's
currently free and you can get it via http://apps.sematext.com/ (you can
also get free Search Analytics from there).

Otis

Hiring ElasticSearch Engineers World-Wide --

On Wednesday, March 7, 2012 1:19:31 PM UTC+8, Mark Waddle wrote:

Hi Otis,

What tooling are you using to gather and chart those metrics?

Mark

On Tuesday, March 6, 2012 3:13:24 AM UTC-8, Otis Gospodnetic wrote:

Hello,

We've been doing a lot of ElasticSearch performance testing lately.
While testing, we've experienced the "ES shard dance" shown in the
attachment whenever we restarted any of the nodes. This, of course, made
testing hard because we couldn't keep a fixed shard distribution between
restarts and between some of the test runs, plus it slowed us down (you can
see this shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards
around even if it thinks there is a better way to distribute them?

Thanks,
Otis

Hiring ElasticSearch Engineers World-Wide --
http://sematext.com/about/jobs.html#search


(Paul Brown) #6

It's not where Otis's graphs are coming from, but we get similar graphs out of OpenTSDB/tcollector attached to Elasticsearch. (We use OpenTSDB/tcollector with a simple graphite adapter and Coda Hale's metrics to gather metrics from other systems as well.)

-- Paul

On Mar 6, 2012, at 9:19 PM, Mark Waddle wrote:

Hi Otis,

What tooling are you using to gather and chart those metrics?

Mark

On Tuesday, March 6, 2012 3:13:24 AM UTC-8, Otis Gospodnetic wrote:
Hello,

We've been doing a lot of ElasticSearch performance testing lately. While testing, we've experienced the "ES shard dance" shown in the attachment whenever we restarted any of the nodes. This, of course, made testing hard because we couldn't keep a fixed shard distribution between restarts and between some of the test runs, plus it slowed us down (you can see this shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards around even if it thinks there is a better way to distribute them?

Thanks,
Otis


(Shay Banon) #7

++OpenTSDB

+Graphite

On Wednesday, March 7, 2012 at 7:54 PM, Paul Brown wrote:

It's not where Otis's graphs are coming from, but we get similar graphs out of OpenTSDB/tcollector attached to Elasticsearch. (We use OpenTSDB/tcollector with a simple graphite adapter and Coda Hale's metrics to gather metrics from other systems as well.)

-- Paul
On Mar 6, 2012, at 9:19 PM, Mark Waddle wrote:

Hi Otis,

What tooling are you using to gather and chart those metrics?

Mark

On Tuesday, March 6, 2012 3:13:24 AM UTC-8, Otis Gospodnetic wrote:

Hello,

We've been doing a lot of ElasticSearch performance testing lately. While testing, we've experienced the "ES shard dance" shown in the attachment whenever we restarted any of the nodes. This, of course, made testing hard because we couldn't keep a fixed shard distribution between restarts and between some of the test runs, plus it slowed us down (you can see this shard dance took over 1 hour).

Is it possible to start ElasticSearch and tell it not to move any shards around even if it thinks there is a better way to distribute them?

Thanks,
Otis


(system) #8