Weird shard routing behavior in 0.90.0 using routing exclusions and demoting number_of_replicas?

Background:

elasticsearch: 0.90.0
Java: 1.7.17

I'm using daily indexes for logging data. As time goes on, I'd like fewer
copies of the data in the cluster as it's less critical. The index being
manipulated is already optimized and is no longer taking writes on a 3 node
cluster. I tried playing with shards_per_node and number_of_replicas, but
the routing logic often left an unallocated shard:

{
"settings" : {
"index.number_of_shards" : 3,
"index.number_of_replicas" : 1,
"index.routing.allocation.total_shards_per_node" : 2,
"index.auto_expand_replicas": false
},
}

Ideally this means:

Node 1: [0], [1]
Node 2: [1], [2]
Node 3: [0], [2]

But what happened frequently was:

Node 1: [0], [1]
Node 2: [0], [1]
Node 3: [2]
Unallocated: [2]

So, new idea. Use the day number, and the node index to implement a
routing decision, each node sets an "id" attribute:

Node 1: node.id: 0
Node 2: node.id: 1
Node 3: node.id: 2

Take the index's day of year and modulus the number of nodes to set:
logstash-2013.01.01 then:
index.routing.allocation.exclude = 1

This has no effect on the shards from what I can see, because it needs to
meet the replicas quota of 2. Once the exclude is set, I downgrade the
number of replicas to 1:

index.number_of_replicas = 1

What I thought would happen: The node flagged as excluded drops it's shards.

What actually happens: The primary shards for the index are left where
they are and a replica of that shard is dropped from a node. Now, the
routing/allocation takes over and says "hey, that shard shouldn't be there"
and then begins copying the shards off the excluded node to nodes which
previously had these shards but dropped them because of the replicas
change. This results in a MASSIVE traffic spike as these indexes (just
in testing at this point) are ~65GB each.

Am I doing something wrong, or is this a bug in how this logic is being
handled?

Thanks,

--
Brad Lhotsky

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Brad,

On Wednesday, May 29, 2013 4:41:32 PM UTC+2, Brad Lhotsky wrote:

Background:

elasticsearch: 0.90.0
Java: 1.7.17

I'm using daily indexes for logging data. As time goes on, I'd like fewer
copies of the data in the cluster as it's less critical. The index being
manipulated is already optimized and is no longer taking writes on a 3 node
cluster. I tried playing with shards_per_node and number_of_replicas, but
the routing logic often left an unallocated shard:

{
"settings" : {
"index.number_of_shards" : 3,
"index.number_of_replicas" : 1,
"index.routing.allocation.total_shards_per_node" : 2,
"index.auto_expand_replicas": false
},
}

Ideally this means:

Node 1: [0], [1]
Node 2: [1], [2]
Node 3: [0], [2]

true!

But what happened frequently was:

Node 1: [0], [1]
Node 2: [0], [1]
Node 3: [2]
Unallocated: [2]

Odd! I just added a testcase and this works pretty well for me. What does
"frequently" mean? is this happening on a clean cluster or do you have
multiple indices allocated there? I'd really be curious if you can
reproduce this in a small gist?

simon

So, new idea. Use the day number, and the node index to implement a
routing decision, each node sets an "id" attribute:

Node 1: node.id: 0
Node 2: node.id: 1
Node 3: node.id: 2

Take the index's day of year and modulus the number of nodes to set:
logstash-2013.01.01 then:
index.routing.allocation.exclude = 1

This has no effect on the shards from what I can see, because it needs to
meet the replicas quota of 2. Once the exclude is set, I downgrade the
number of replicas to 1:

index.number_of_replicas = 1

What I thought would happen: The node flagged as excluded drops it's
shards.

What actually happens: The primary shards for the index are left where
they are and a replica of that shard is dropped from a node. Now, the
routing/allocation takes over and says "hey, that shard shouldn't be there"
and then begins copying the shards off the excluded node to nodes which
previously had these shards but dropped them because of the replicas
change. This results in a MASSIVE traffic spike as these indexes (just
in testing at this point) are ~65GB each.

Am I doing something wrong, or is this a bug in how this logic is being
handled?

Thanks,

--
Brad Lhotsky

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wednesday, May 29, 2013 11:43:08 PM UTC+2, simonw wrote:

Hey Brad,

On Wednesday, May 29, 2013 4:41:32 PM UTC+2, Brad Lhotsky wrote:

Background:

elasticsearch: 0.90.0
Java: 1.7.17

I'm using daily indexes for logging data. As time goes on, I'd like
fewer copies of the data in the cluster as it's less critical. The index
being manipulated is already optimized and is no longer taking writes on a
3 node cluster. I tried playing with shards_per_node and
number_of_replicas, but the routing logic often left an unallocated shard:

{
"settings" : {
"index.number_of_shards" : 3,
"index.number_of_replicas" : 1,
"index.routing.allocation.total_shards_per_node" : 2,
"index.auto_expand_replicas": false
},
}

Ideally this means:

Node 1: [0], [1]
Node 2: [1], [2]
Node 3: [0], [2]

true!

But what happened frequently was:

Node 1: [0], [1]
Node 2: [0], [1]
Node 3: [2]
Unallocated: [2]

Odd! I just added a testcase and this works pretty well for me. What does
"frequently" mean? is this happening on a clean cluster or do you have
multiple indices allocated there? I'd really be curious if you can
reproduce this in a small gist?

Had to abandon this method because of this problem, but these indexes were
being created using logstash. Frequently means ~ 1 in 7 times on my
production cluster and about 1 in 3 in dev (older hardware, less memory,
substantially higher load from processing the same data as in production).
I spoke to Shay about this in the past and he stated it's a hard problem to
solve and that I should investigate other ways of distributing the data,
either by moving entire indexes or just excluding indexes from routing.

I appreciate the interest in this problem, but feel the matter below is
more concerning.

So, new idea. Use the day number, and the node index to implement a

routing decision, each node sets an "id" attribute:

Node 1: node.id: 0
Node 2: node.id: 1
Node 3: node.id: 2

Take the index's day of year and modulus the number of nodes to set:
logstash-2013.01.01 then:
index.routing.allocation.exclude = 1

This has no effect on the shards from what I can see, because it needs to
meet the replicas quota of 2. Once the exclude is set, I downgrade the
number of replicas to 1:

index.number_of_replicas = 1

What I thought would happen: The node flagged as excluded drops it's
shards.

What actually happens: The primary shards for the index are left where
they are and a replica of that shard is dropped from a node. Now, the
routing/allocation takes over and says "hey, that shard shouldn't be there"
and then begins copying the shards off the excluded node to nodes which
previously had these shards but dropped them because of the replicas
change. This results in a MASSIVE traffic spike as these indexes (just
in testing at this point) are ~65GB each.

Am I doing something wrong, or is this a bug in how this logic is being
handled?

Thanks,

--
Brad Lhotsky

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.