IndexMissingException on non-master/non-data node after master restart

Hi all,

I might have found a bug in the way a non-master node reconnects to a
master.

I use the Java API to create a node using this (subset of) properties:

ImmutableSettings.settingsBuilder()
.put("node.data", false)
.put("node.local", false)
.put("node.master", false)
.put("network.host", "127.0.0.1")
.put("discovery.type", "zen")
.put("discovery.zen.minimum_master_nodes", 1)
.put("discovery.zen.ping.multicast.enabled", false)
.putArray("discovery.zen.ping.unicast.hosts", "127.0.0.1")
.build();

On the same host, there is a single standalone instance, configured
with this elasticsearch.yml:

node.master: true
node.data: true
network.host: 127.0.0.1
discovery.type: zen
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["127.0.0.1"]

The client is able to connect just fine. However, as soon as I restart
the standalone instance (i.e. the master), resulting in the following
logs on the client:

INFO - zen - [Arturo Falcones] master_left
[[Rhodes, James][U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]
{master=true}], reason [transport disconnected (with verified
connect)]
WARN - zen - [Arturo Falcones] not enough
master nodes after master left (reason = transport disconnected (with
verified connect)), current nodes: {[Arturo Falcones]
[aP03z07qSFaewHW0piuNiA][inet[/127.0.0.1:9301]]{data=false,
local=false, master=false},}
INFO - service - [Arturo Falcones] removed
{[Rhodes, James][U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]
{master=true},}, reason: zen-disco-master_failed ([Rhodes, James]
[U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]{master=true})
INFO - service - [Arturo Falcones] detected_master
[Turac][d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]{master=true},
added {[Turac][d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]
{master=true},}, reason: zen-disco-receive(from master [[Turac]
[d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]{master=true}])

It looks as if the client reconnected just fine. The problem is
though, that every search (didn't try any other operations yet)
results in a IndexMissingException exception:

Caused by: org.elasticsearch.indices.IndexMissingException: [default]
missing
at
org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.indexRoutingTable(PlainOperationRouting.java:
230)
at
org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.searchShards(PlainOperationRouting.java:
175)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.(TransportSearchTypeAction.java:118)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.(TransportSearchQueryThenFetchAction.java:70)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.(TransportSearchQueryThenFetchAction.java:61)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:
58)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:
48)
at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:
61)
at
org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:
108)
at
org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:
43)
at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:
61)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:
83)
at
org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:
206)
at
org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:
743)
at
org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:
53)
at
org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:
47)

Browsing the default index with elasticsearch-head works just fine.
Restarting the client solves the problem too. Hence I'm assuming there
is a problem with the routing metadata being not correctly restored
after reconnect.

Is this indeed a bug or should I explicitly restart the node as soon
as the master went down?

Cheers, Stefan

I tried to recreate it, but couldn't. Started a standalone node with teh
mentioned config, then fired this code: gist:2491022 · GitHub.
Restarted the first node, and things worked nicely. I tested on 0.19.2.

Note, you might get into problems if the "master/data" node ends up
starting with port that is not 9300, which might happen if you start it
after the client (and then the client will use 9300). In this case, it make
sense to have the unicast list look like this: "127.0.0.1:9300-9301", or
just have two entries, one for 9300 and one for 9301.

On Tue, Apr 24, 2012 at 4:05 PM, sfussenegger
stefan.fussenegger@gmail.comwrote:

Hi all,

I might have found a bug in the way a non-master node reconnects to a
master.

I use the Java API to create a node using this (subset of) properties:

ImmutableSettings.settingsBuilder()
.put("node.data", false)
.put("node.local", false)
.put("node.master", false)
.put("network.host", "127.0.0.1")
.put("discovery.type", "zen")
.put("discovery.zen.minimum_master_nodes", 1)
.put("discovery.zen.ping.multicast.enabled", false)
.putArray("discovery.zen.ping.unicast.hosts", "127.0.0.1")
.build();

On the same host, there is a single standalone instance, configured
with this elasticsearch.yml:

node.master: true
node.data: true
network.host: 127.0.0.1
discovery.type: zen
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["127.0.0.1"]

The client is able to connect just fine. However, as soon as I restart
the standalone instance (i.e. the master), resulting in the following
logs on the client:

INFO - zen - [Arturo Falcones] master_left
[[Rhodes, James][U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]
{master=true}], reason [transport disconnected (with verified
connect)]
WARN - zen - [Arturo Falcones] not enough
master nodes after master left (reason = transport disconnected (with
verified connect)), current nodes: {[Arturo Falcones]
[aP03z07qSFaewHW0piuNiA][inet[/127.0.0.1:9301]]{data=false,
local=false, master=false},}
INFO - service - [Arturo Falcones] removed
{[Rhodes, James][U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]
{master=true},}, reason: zen-disco-master_failed ([Rhodes, James]
[U9u4pvSYSMiu2D7pCavK9Q][inet[/127.0.0.1:9300]]{master=true})
INFO - service - [Arturo Falcones] detected_master
[Turac][d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]{master=true},
added {[Turac][d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]
{master=true},}, reason: zen-disco-receive(from master [[Turac]
[d9uKtBXERde7zrmNIe9E5A][inet[/127.0.0.1:9300]]{master=true}])

It looks as if the client reconnected just fine. The problem is
though, that every search (didn't try any other operations yet)
results in a IndexMissingException exception:

Caused by: org.elasticsearch.indices.IndexMissingException: [default]
missing
at

org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.indexRoutingTable(PlainOperationRouting.java:
230)
at

org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.searchShards(PlainOperationRouting.java:
175)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.(TransportSearchTypeAction.java:118)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.(TransportSearchQueryThenFetchAction.java:70)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.(TransportSearchQueryThenFetchAction.java:61)
at

org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:
58)
at

org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:
48)
at

org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:
61)
at

org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:
108)
at

org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:
43)
at

org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:
61)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:
83)
at
org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:
206)
at

org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:
743)
at

org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:
53)
at

org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:
47)

Browsing the default index with elasticsearch-head works just fine.
Restarting the client solves the problem too. Hence I'm assuming there
is a problem with the routing metadata being not correctly restored
after reconnect.

Is this indeed a bug or should I explicitly restart the node as soon
as the master went down?

Cheers, Stefan

Hi Kimchy/All,
i did a lot of test...
it seems that happens when master on port 9200 go down ( using head GUI shutdown command or killing process ).

if i kill a master on port != 9200, new master discover run fine.
I'm using 0.19.3.

Regards!
Very Very Nice Work

Whats your configuration? You mean that you managed to recreate the index
missing part?

On Thu, May 17, 2012 at 6:27 PM, fabio curti fabio.curti@gmail.com wrote:

Hi Kimchy/All,
i did a lot of test...
it seems that happens when master on port 9200 go down ( using head GUI
shutdown command or killing process ).

if i kill a master on port != 9200, new master discover run fine.
I'm using 0.19.3.

Regards!
Very Very Nice Work

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/IndexMissingException-on-non-master-non-data-node-after-master-restart-tp3935240p3999344.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

I'm running elasticsearch as a cluster on a single Server in order to have a basic approach on this.
I configure a cluster as follow:
3 elegible master ( 2 of them with no data )
1 search balancer ( no data, no master )
2 data search ( 1 of them elegible as master )

As i wrote before, if i shutdown the active master on port 9200, the cluster goes down ( new master discover failed ).
reason: zen-disco-master_failed

[2012-05-24 11:03:31,436][INFO ][discovery.zen ] [CL-01 DATA MASTER] master_left [[CL-00 MASTER][8hjehXIfSReVXdTt-M6FaA][inet[/10.172.81.97:9300]]{data=false, rack=cl00, master=true}], reason [shut_down]

[2012-05-24 11:03:33,345][INFO ][cluster.service ] [CL-01 DATA MASTER] master {new [CL-04 MASTER][7igYRM0aRrurFkFDk0iW_A][inet[/10.172.81.97:9304]]{data=false, rack=cl04, master=true}, previous [CL-00 MASTER][8hjehXIfSReVXdTt-M6FaA][inet[/10.172.81.97:9300]]{data=false, rack=cl00, master=true}}, removed {[CL-00 MASTER][8hjehXIfSReVXdTt-M6FaA][inet[/10.172.81.97:9300]]{data=false, rack=cl00, master=true},}, reason: zen-disco-master_failed ([CL-00 MASTER][8hjehXIfSReVXdTt-M6FaA][inet[/10.172.81.97:9300]]{data=false, rack=cl00, master=true})

[2012-05-24 11:03:44,533][INFO ][discovery.zen ] [CL-01 DATA MASTER] master_left [[CL-04 MASTER][7igYRM0aRrurFkFDk0iW_A][inet[/10.172.81.97:9304]]{data=false, rack=cl04, master=true}], reason [no longer master]

[2012-05-24 11:03:44,925][WARN ][discovery.zen ] [CL-01 DATA MASTER] not enough master nodes after master left (reason = no longer master), current nodes: {[CL-02 DATA][Dqp1YAEVRiudieFhJUaDzA][inet[/10.172.81.97:9302]]{rack=cl02, master=false},[CL-01 DATA MASTER][w-Xb6BF7Qcu7Z0UM7ryQtg][inet[/10.172.81.97:9301]]{rack=cl01, master=true},[CL-03 SEARCH BALANCER][OC3vDCpVT8-OJNR2eeIpQA][inet[/10.172.81.97:9303]]{data=false, rack=cl03, master=false},}

[2012-05-24 11:03:44,945][INFO ][cluster.service ] [CL-01 DATA MASTER] removed {[CL-02 DATA][Dqp1YAEVRiudieFhJUaDzA][inet[/10.172.81.97:9302]]{rack=cl02, master=false},[CL-04 MASTER][7igYRM0aRrurFkFDk0iW_A][inet[/10.172.81.97:9304]]{data=false, rack=cl04, master=true},[CL-03 SEARCH BALANCER][OC3vDCpVT8-OJNR2eeIpQA][inet[/10.172.81.97:9303]]{data=false, rack=cl03, master=false},}, reason: zen-disco-master_failed ([CL-04 MASTER][7igYRM0aRrurFkFDk0iW_A][inet[/10.172.81.97:9304]]{data=false, rack=cl04, master=true})

Regards