Couple of exceptions on master

ppearcy · August 26, 2010, 3:37pm

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:

gist.github.com

https://gist.github.com/anonymous/551622

gistfile1.txt

[20:18:44,230][DEBUG][action.search.type       ] [John Walker] [index21][0], node[46e2473e-6a5b-4a77-bdf7-83605caaf48a], relocating [2b9c70c1-1b10-48e2-b8bc-19a73578545d], [P], s[RELOCATING]: Failed to execute [org.elasticsearch.action.search.SearchRequest@199d80b5]
org.elasticsearch.index.IndexShardMissingException: [index21][0] missing
        at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:156)
        at org.elasticsearch.search.SearchService.createContext(SearchService.java:288)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:166)
        at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:129)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:77)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:194)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.access$000(TransportSearchTypeAction.java:80)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.run(TransportSearchTypeAction.java:153)

This file has been truncated. show original

These appear to be failures related to the shard reallocating.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:

gist.github.com

https://gist.github.com/anonymous/551611

gistfile1.txt

[20:44:10,061][DEBUG][action.search.type       ] [John Walker] [index21][0], node[2b9c70c1-1b10-48e2-b8bc-19a73578545d], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4b751c3b]
org.elasticsearch.transport.RemoteTransportException: None remote transport exception
Caused by: org.elasticsearch.transport.ResponseHandlerFailureTransportException: Failed to handle response
        at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:145)
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:102)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)

This file has been truncated. show original

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Thanks,
Paul

kimchy · August 27, 2010, 11:20am

Hi Paul, can you open issues for the 3 cases below, just so we track them.
Answers below:

On Thu, Aug 26, 2010 at 6:37 PM, Paul ppearcy@gmail.com wrote:

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:
gist:551622 · GitHub

These appear to be failures related to the shard reallocating.

Yea, looks like it. It should not happen though. I will try and recreate and
fix this.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:
gist:551611 · GitHub

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Agreed. I will try and see if I can recreate this as well and check.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Yes, this is how it works currently. I agree, it can be done in a more
clever manner without recovering from the gateway though its a bit
complicated ;). I think the majority of the scenarios is running with at
least 1 replica, but still, elasticsearch should be complete and handle in
the most efficient manner a case when there is no replica.

Thanks,
Paul

ppearcy · August 30, 2010, 1:18am

Thanks Shay. I opened tickets for these items.

Like I said, they are non-issues for me, since I will run with minimum
of replicas=1, but it is great to see such an effort towards quality.

As a side notes, congrats on pushing 0.10. The refactored gateway
recovery seems to be working much more efficiently and I haven't run
into any issues.

Best Regards,
Paul

On Aug 27, 5:20 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul, can you open issues for the 3 cases below, just so we track them.
Answers below:

On Thu, Aug 26, 2010 at 6:37 PM, Paul ppea...@gmail.com wrote:

I was playing around with master last night in a distributed set up
with two servers and replicas = 0, while running queries on 50
threads.

When I ran similar tests with replicas=1, I didn't have any issues. I
plan to use replicas=1 in the real world and this testing was just for
performance numbers.

One test that I ran was adding a new node. While this was occurring, I
saw large swaths of exceptions for queries that look like this:
gist:551622 · GitHub

These appear to be failures related to the shard reallocating.

Yea, looks like it. It should not happen though. I will try and recreate and
fix this.

Once half the shards have been reallocated to the new box and I was in
a steady state, I saw intermittent exceptions like this:
gist:551611 · GitHub

These are quite unexpected as there was nothing in flux. I was running
all the queries against a single node and this is the one that
reported the errors.

Agreed. I will try and see if I can recreate this as well and check.

Finally, when using replicas=0 and removing a node, is there a clean
way to shutdown the node, so that it reallocates its shards first, so
that all shards are available at all times? It appears that through
the service stop or node shutdown API, the service just stops and
other nodes must recover from the gateway.

Yes, this is how it works currently. I agree, it can be done in a more
clever manner without recovering from the gateway though its a bit
complicated ;). I think the majority of the scenarios is running with at
least 1 replica, but still, elasticsearch should be complete and handle in
the most efficient manner a case when there is no replica.

Thanks,
Paul

Topic		Replies	Views
Why there are always a couple of exceptions on the master instance in the beginning of starting elasticsearch Elasticsearch	5	985	March 11, 2018
Getting MasterNotDiscoveredException for ES node for client Elasticsearch	12	6532	July 6, 2017
Strange behavior after losing a solo master node Elasticsearch	1	352	July 6, 2017
Getting “master not discovered exception” even running only one node Elasticsearch	2	1817	April 13, 2020
Elasticsearch exceptions post upgrade Elasticsearch	1	630	August 9, 2017

Couple of exceptions on master

Related topics