Ok, why isn't my cluster rebalancing?

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have allocation
awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

can you provide more infromation about your setup ie all the node settings
etc. the elasticsearch version you are running, any allocation decider
settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have allocation
awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

  1. {
  2. "persistent": {
  3. "cluster.routing.allocation.awareness.attributes": "awszone"
  4. },
  5. "transient": {}
  6. }

Sample Node:

  1. "nCG5WzNRS6mRFF0ULbLYtA": {
  2. "name": "Nelson, Foggy",
  3. "transport_address": "inet[/10.98.14.42:9300]",
  4. "hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
  5. "version": "0.90.3",
  6. "http_address": "inet[/10.98.14.42:9200]",
  7. "attributes": {
  8. "awszone": "us-west-2a"
  9. }
  10. },

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node settings
etc. the elasticsearch version you are running, any allocation decider
settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have allocation
awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could you tell me a couple of more things so I can try to reproduce this:

  • how many nodes and their zone related configs you use
  • the index settings you used to create this index?

I wanna write a test that 'mocks' your situation if you have any custom
settings please let me know. I'd also be curious if all your shards are in
state "active" or if something is relocating?
is your cluster green?

simon

On Friday, August 23, 2013 9:42:50 PM UTC+2, Pierce Wetter wrote:

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

  1. {
  2. "persistent": {
  3. "cluster.routing.allocation.awareness.attributes": "awszone"
  4. },
  5. "transient": {}
  6. }

Sample Node:

  1. "nCG5WzNRS6mRFF0ULbLYtA": {
  2. "name": "Nelson, Foggy",
  3. "transport_address": "inet[/10.98.14.42:9300]",
  4. "hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
  5. "version": "0.90.3",
  6. "http_address": "inet[/10.98.14.42:9200]",
  7. "attributes": {
  8. "awszone": "us-west-2a"
  9. }
  10. },

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node
settings etc. the elasticsearch version you are running, any allocation
decider settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have allocation
awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could you tell me a couple of more things so I can try to reproduce this:

  • how many nodes and their zone related configs you use

6 nodes across 4 zones. Replica count of 3 so I get one shard per zone somewhere

2 nodes: west-a zone
2 nodes: west-b zone
1 node: east-a zone
1 node: east-b zone

From the original screen shot you can see its all green.

East-a has 1 node, 20 shards
East-b has 1 node, 20 shards
West-a has 1 node with 19 shards, 1 with one shard
West-b has 1 node with 19 shards, 1 with one shard

stopping a node in the west causes the remaining node to get 20 shards. If I start it up again, a single shard migrates back.

Self inflicted part: In bootstrapping this cluster I brought up west first. Then I added east which started moving shards around. I realized I needed more replicas for full east/west redundancy and increased the count to 3. The ones in the east got stuck migrating, they stuck for days, so I shut them down and a stuck western node or two, one of which was the primary. That brought the cluster back to a fully green state 20 shards everywhere with some unallocated. I started up east again and it copied shards to the east quickly, possibly taking the replicas away from western nodes but without rebalancing.

Other than the replica count and the allocation awareness set to the zone listed below, our settings are pretty much the defaults.

  • the index settings you used to create this index?

I wanna write a test that 'mocks' your situation if you have any custom settings please let me know. I'd also be curious if all your shards are in state "active" or if something is relocating?
is your cluster green?

simon

On Friday, August 23, 2013 9:42:50 PM UTC+2, Pierce Wetter wrote:

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "awszone"
},
"transient": {}
}

Sample Node:

"nCG5WzNRS6mRFF0ULbLYtA": {
"name": "Nelson, Foggy",
"transport_address": "inet[/10.98.14.42:9300]",
"hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
"version": "0.90.3",
"http_address": "inet[/10.98.14.42:9200]",
"attributes": {
"awszone": "us-west-2a"
}
},

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node settings etc. the elasticsearch version you are running, any allocation decider settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

Background: I've setup the number of shards to 3, and I have allocation awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the country. I would also expect the nodes in us-west to balance the shards, not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hey,

I can actually reproduce this locally. Would you please open a dev issue on
github for this I will try getting a fix for this very quickly.

simon

On Saturday, August 24, 2013 9:56:03 AM UTC+2, Pierce Wetter wrote:

Could you tell me a couple of more things so I can try to reproduce this:

  • how many nodes and their zone related configs you use

6 nodes across 4 zones. Replica count of 3 so I get one shard per zone
somewhere

2 nodes: west-a zone
2 nodes: west-b zone
1 node: east-a zone
1 node: east-b zone

From the original screen shot you can see its all green.

East-a has 1 node, 20 shards
East-b has 1 node, 20 shards
West-a has 1 node with 19 shards, 1 with one shard
West-b has 1 node with 19 shards, 1 with one shard

stopping a node in the west causes the remaining node to get 20 shards.
If I start it up again, a single shard migrates back.

Self inflicted part: In bootstrapping this cluster I brought up west
first. Then I added east which started moving shards around. I realized I
needed more replicas for full east/west redundancy and increased the count
to 3. The ones in the east got stuck migrating, they stuck for days, so I
shut them down and a stuck western node or two, one of which was the
primary. That brought the cluster back to a fully green state 20 shards
everywhere with some unallocated. I started up east again and it copied
shards to the east quickly, possibly taking the replicas away from western
nodes but without rebalancing.

Other than the replica count and the allocation awareness set to the zone
listed below, our settings are pretty much the defaults.

  • the index settings you used to create this index?

I wanna write a test that 'mocks' your situation if you have any custom
settings please let me know. I'd also be curious if all your shards are in
state "active" or if something is relocating?
is your cluster green?

simon

On Friday, August 23, 2013 9:42:50 PM UTC+2, Pierce Wetter wrote:

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

  1. {
  2. "persistent": {
  3. "cluster.routing.allocation.awareness.attributes": "awszone"
  4. },
  5. "transient": {}
  6. }

Sample Node:

  1. "nCG5WzNRS6mRFF0ULbLYtA": {
  2. "name": "Nelson, Foggy",
  3. "transport_address": "inet[/10.98.14.42:9300]",
  4. "hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
  5. "version": "0.90.3",
  6. "http_address": "inet[/10.98.14.42:9200]",
  7. "attributes": {
  8. "awszone": "us-west-2a"
  9. }
  10. },

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node
settings etc. the elasticsearch version you are running, any allocation
decider settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have allocation
awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Would you like a more descriptive title than cluster not rebalancing?

Sent from my iPad

On Aug 26, 2013, at 8:32 AM, simonw simon.willnauer@elasticsearch.com wrote:

hey,

I can actually reproduce this locally. Would you please open a dev issue on github for this I will try getting a fix for this very quickly.

simon

On Saturday, August 24, 2013 9:56:03 AM UTC+2, Pierce Wetter wrote:

Could you tell me a couple of more things so I can try to reproduce this:

  • how many nodes and their zone related configs you use

6 nodes across 4 zones. Replica count of 3 so I get one shard per zone somewhere

2 nodes: west-a zone
2 nodes: west-b zone
1 node: east-a zone
1 node: east-b zone

From the original screen shot you can see its all green.

East-a has 1 node, 20 shards
East-b has 1 node, 20 shards
West-a has 1 node with 19 shards, 1 with one shard
West-b has 1 node with 19 shards, 1 with one shard

stopping a node in the west causes the remaining node to get 20 shards. If I start it up again, a single shard migrates back.

Self inflicted part: In bootstrapping this cluster I brought up west first. Then I added east which started moving shards around. I realized I needed more replicas for full east/west redundancy and increased the count to 3. The ones in the east got stuck migrating, they stuck for days, so I shut them down and a stuck western node or two, one of which was the primary. That brought the cluster back to a fully green state 20 shards everywhere with some unallocated. I started up east again and it copied shards to the east quickly, possibly taking the replicas away from western nodes but without rebalancing.

Other than the replica count and the allocation awareness set to the zone listed below, our settings are pretty much the defaults.

  • the index settings you used to create this index?

I wanna write a test that 'mocks' your situation if you have any custom settings please let me know. I'd also be curious if all your shards are in state "active" or if something is relocating?
is your cluster green?

simon

On Friday, August 23, 2013 9:42:50 PM UTC+2, Pierce Wetter wrote:

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "awszone"
},
"transient": {}
}

Sample Node:

"nCG5WzNRS6mRFF0ULbLYtA": {
"name": "Nelson, Foggy",
"transport_address": "inet[/10.98.14.42:9300]",
"hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
"version": "0.90.3",
"http_address": "inet[/10.98.14.42:9200]",
"attributes": {
"awszone": "us-west-2a"
}
},

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node settings etc. the elasticsearch version you are running, any allocation decider settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

Background: I've setup the number of shards to 3, and I have allocation awareness set to match the awszone property which I'm setting in a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the country. I would also expect the nodes in us-west to balance the shards, not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think something like "Forced Awareness fails to balance shards" or so?
BTW, I found out once you start the same amount of nodes in all zones it
starts balancing in my tests. Need to dig deeper!

simon

On Monday, August 26, 2013 6:53:15 PM UTC+2, Pierce Wetter wrote:

Would you like a more descriptive title than cluster not rebalancing?

Sent from my iPad

On Aug 26, 2013, at 8:32 AM, simonw <simon.w...@elasticsearch.com<javascript:>>
wrote:

hey,

I can actually reproduce this locally. Would you please open a dev issue
on github for this I will try getting a fix for this very quickly.

simon

On Saturday, August 24, 2013 9:56:03 AM UTC+2, Pierce Wetter wrote:

Could you tell me a couple of more things so I can try to reproduce this:

  • how many nodes and their zone related configs you use

6 nodes across 4 zones. Replica count of 3 so I get one shard per zone
somewhere

2 nodes: west-a zone
2 nodes: west-b zone
1 node: east-a zone
1 node: east-b zone

From the original screen shot you can see its all green.

East-a has 1 node, 20 shards
East-b has 1 node, 20 shards
West-a has 1 node with 19 shards, 1 with one shard
West-b has 1 node with 19 shards, 1 with one shard

stopping a node in the west causes the remaining node to get 20 shards.
If I start it up again, a single shard migrates back.

Self inflicted part: In bootstrapping this cluster I brought up west
first. Then I added east which started moving shards around. I realized I
needed more replicas for full east/west redundancy and increased the count
to 3. The ones in the east got stuck migrating, they stuck for days, so I
shut them down and a stuck western node or two, one of which was the
primary. That brought the cluster back to a fully green state 20 shards
everywhere with some unallocated. I started up east again and it copied
shards to the east quickly, possibly taking the replicas away from western
nodes but without rebalancing.

Other than the replica count and the allocation awareness set to the zone
listed below, our settings are pretty much the defaults.

  • the index settings you used to create this index?

I wanna write a test that 'mocks' your situation if you have any custom
settings please let me know. I'd also be curious if all your shards are in
state "active" or if something is relocating?
is your cluster green?

simon

On Friday, August 23, 2013 9:42:50 PM UTC+2, Pierce Wetter wrote:

We're using 0.90.3 with a cluster we pretty much stood up just of this.

Here are the cluster settings: (though this duplicates the config file).

  1. {
  2. "persistent": {
  3. "cluster.routing.allocation.awareness.attributes": "awszone"
  4. },
  5. "transient": {}
  6. }

Sample Node:

  1. "nCG5WzNRS6mRFF0ULbLYtA": {
  2. "name": "Nelson, Foggy",
  3. "transport_address": "inet[/10.98.14.42:9300]",
  4. "hostname": "reader-elastic03.prod2.cloud.cheggnet.com",
  5. "version": "0.90.3",
  6. "http_address": "inet[/10.98.14.42:9200]",
  7. "attributes": {
  8. "awszone": "us-west-2a"
  9. }
  10. },

On Thursday, August 22, 2013 11:06:50 PM UTC-7, simonw wrote:

can you provide more infromation about your setup ie all the node
settings etc. the elasticsearch version you are running, any allocation
decider settings (curl -XGET localhost:9200/_cluster/settings)

simon

On Friday, August 23, 2013 1:50:12 AM UTC+2, Pierce Wetter wrote:

https://lh6.googleusercontent.com/-fC-ZnNFwv7M/UhajAZA2ovI/AAAAAAAAA0w/FhJ9g6D9TWc/s1600/Screen+Shot+2013-08-22+at+4.46.52+PM.png

Background: I've setup the number of shards to 3, and I have
allocation awareness set to match the awszone property which I'm setting in
a config.

There's 6 nodes total:

2 in us-west-2a
2 in us-west-2b
1 in us-east-2a
1 in us-east-2b

So in theory, I should be able to lose either an entire side of the
country. I would also expect the nodes in us-west to balance the shards,
not just allocate a single shard to the machine.

Any ideas?

Pierce

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9yZw7sryFb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Interesting. Issue created: https://github.com/elasticsearch/elasticsearch/issues/3580

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.