Couple of exceptions on master


(ppearcy) #1

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:

These appear to be failures related to the shard reallocating.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Thanks,
Paul


(Shay Banon) #2

Hi Paul, can you open issues for the 3 cases below, just so we track them.
Answers below:

On Thu, Aug 26, 2010 at 6:37 PM, Paul ppearcy@gmail.com wrote:

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:
http://gist.github.com/551622

These appear to be failures related to the shard reallocating.

Yea, looks like it. It should not happen though. I will try and recreate and
fix this.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:
http://gist.github.com/551611

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Agreed. I will try and see if I can recreate this as well and check.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Yes, this is how it works currently. I agree, it can be done in a more
clever manner without recovering from the gateway though its a bit
complicated ;). I think the majority of the scenarios is running with at
least 1 replica, but still, elasticsearch should be complete and handle in
the most efficient manner a case when there is no replica.

Thanks,
Paul


(ppearcy) #3

Thanks Shay. I opened tickets for these items.

Like I said, they are non-issues for me, since I will run with minimum
of replicas=1, but it is great to see such an effort towards quality.

As a side notes, congrats on pushing 0.10. The refactored gateway
recovery seems to be working much more efficiently and I haven't run
into any issues.

Best Regards,
Paul

On Aug 27, 5:20 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul, can you open issues for the 3 cases below, just so we track them.
Answers below:

On Thu, Aug 26, 2010 at 6:37 PM, Paul ppea...@gmail.com wrote:

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:
http://gist.github.com/551622

These appear to be failures related to the shard reallocating.

Yea, looks like it. It should not happen though. I will try and recreate and
fix this.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:
http://gist.github.com/551611

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Agreed. I will try and see if I can recreate this as well and check.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Yes, this is how it works currently. I agree, it can be done in a more
clever manner without recovering from the gateway though its a bit
complicated ;). I think the majority of the scenarios is running with at
least 1 replica, but still, elasticsearch should be complete and handle in
the most efficient manner a case when there is no replica.

Thanks,
Paul


(system) #4