River does not start correctly after full cluster shutdown

Hi,

When I shutdown the entire cluster, the river fails to start on the
next startup.
I am using the head version of the 0.16 branch and a local gateway.

The following warning is logged:
[2011-05-04 16:19:39,401][WARN ][river ]
[Moonstone] failed to get _meta from [testcluster]/[testriver]
org.elasticsearch.action.NoShardAvailableActionException: [_river][0]
No shard available for [testriver#_meta]
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:193)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.performFirst(TransportShardSingleOperationAction.java:159)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:116)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:217)
at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It seems that the riverservice tries to get the _meta before the state
has been restored and never retries afterwards.
If I manually kill the node issuing the warning, the river gets
correctly started on another node.

Best,
Michel

Opened an issue here: Possible failure to start a river after cluster restart · Issue #902 · elastic/elasticsearch · GitHub. I can't recreate it here, can you recreate it on your end? I will push a fix, but without being able to recreate it, you will need to verify it....
On Wednesday, May 4, 2011 at 5:41 PM, Michel Conrad wrote:

Hi,

When I shutdown the entire cluster, the river fails to start on the
next startup.
I am using the head version of the 0.16 branch and a local gateway.

The following warning is logged:
[2011-05-04 16:19:39,401][WARN ][river ]
[Moonstone] failed to get _meta from [testcluster]/[testriver]
org.elasticsearch.action.NoShardAvailableActionException: [_river][0]
No shard available for [testriver#_meta]
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:193)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.performFirst(TransportShardSingleOperationAction.java:159)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:116)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:217)
at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It seems that the riverservice tries to get the _meta before the state
has been restored and never retries afterwards.
If I manually kill the node issuing the warning, the river gets
correctly started on another node.

Best,
Michel

I will try it out as soon as you push the fix.
What might be of help, I am currently using 5 nodes with 2 replicas.
It does not happen every time, but nearly every time.
Usually it happens if the river is being started on one of the first
nodes (in the order in which the nodes are started).

On Wed, May 4, 2011 at 7:02 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Opened an issue
here: Possible failure to start a river after cluster restart · Issue #902 · elastic/elasticsearch · GitHub. I can't
recreate it here, can you recreate it on your end? I will push a fix, but
without being able to recreate it, you will need to verify it....

On Wednesday, May 4, 2011 at 5:41 PM, Michel Conrad wrote:

Hi,

When I shutdown the entire cluster, the river fails to start on the
next startup.
I am using the head version of the 0.16 branch and a local gateway.

The following warning is logged:
[2011-05-04 16:19:39,401][WARN ][river ]
[Moonstone] failed to get _meta from [testcluster]/[testriver]
org.elasticsearch.action.NoShardAvailableActionException: [_river][0]
No shard available for [testriver#_meta]
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:193)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.performFirst(TransportShardSingleOperationAction.java:159)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:116)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at
org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at
org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at
org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:217)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It seems that the riverservice tries to get the _meta before the state
has been restored and never retries afterwards.
If I manually kill the node issuing the warning, the river gets
correctly started on another node.

Best,
Michel

Hi,

I got some problems with rivers not starting correctly after a full
cluster shutdown. This time is is not a
NoShardAvailableActionException, but a IndexMissingException, where
the river will not be started. The river is only started correctly on
another node if I kill the node getting the exception. I think the
retry logic in RiversService.java might also retry the get when
an IndexMissingException occurs.

Best,
Michel

[2011-06-15 18:19:41,298][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Thing] received ping response
with no matching id [1]
[2011-06-15 18:19:41,298][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Thing] received ping response
with no matching id [1]
[2011-06-15 18:19:44,440][ESC[33mWARN ESC[0m][river
] [Thing] failed to get _meta from [trendiction]/[dsearch_en]
org.elasticsearch.indices.IndexMissingException: [_river] missing
at org.elasticsearch.cluster.metadata.MetaData.concreteIndex(MetaData.java:211)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:107)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:88)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:220)
at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-06-15 18:19:45,173][ESC[33mWARN ESC[0m][discovery
] [Thing] waited for 30s and no initial state was set by the discovery

On Wed, May 4, 2011 at 7:14 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

I will try it out as soon as you push the fix.
What might be of help, I am currently using 5 nodes with 2 replicas.
It does not happen every time, but nearly every time.
Usually it happens if the river is being started on one of the first
nodes (in the order in which the nodes are started).

On Wed, May 4, 2011 at 7:02 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Opened an issue
here: Possible failure to start a river after cluster restart · Issue #902 · elastic/elasticsearch · GitHub. I can't
recreate it here, can you recreate it on your end? I will push a fix, but
without being able to recreate it, you will need to verify it....

On Wednesday, May 4, 2011 at 5:41 PM, Michel Conrad wrote:

Hi,

When I shutdown the entire cluster, the river fails to start on the
next startup.
I am using the head version of the 0.16 branch and a local gateway.

The following warning is logged:
[2011-05-04 16:19:39,401][WARN ][river ]
[Moonstone] failed to get _meta from [testcluster]/[testriver]
org.elasticsearch.action.NoShardAvailableActionException: [_river][0]
No shard available for [testriver#_meta]
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:193)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.performFirst(TransportShardSingleOperationAction.java:159)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:116)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at
org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at
org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at
org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:217)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It seems that the riverservice tries to get the _meta before the state
has been restored and never retries afterwards.
If I manually kill the node issuing the warning, the river gets
correctly started on another node.

Best,
Michel

Yea, index missing should also be handled, opened an issue: River might not start properly after cluster shutdown (timing) · Issue #1038 · elastic/elasticsearch · GitHub.

On Thursday, June 16, 2011 at 11:25 AM, Michel Conrad wrote:

Hi,

I got some problems with rivers not starting correctly after a full
cluster shutdown. This time is is not a
NoShardAvailableActionException, but a IndexMissingException, where
the river will not be started. The river is only started correctly on
another node if I kill the node getting the exception. I think the
retry logic in RiversService.java might also retry the get when
an IndexMissingException occurs.

Best,
Michel

[2011-06-15 18:19:41,298][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Thing] received ping response
with no matching id [1]
[2011-06-15 18:19:41,298][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Thing] received ping response
with no matching id [1]
[2011-06-15 18:19:44,440][ESC[33mWARN ESC[0m][river
] [Thing] failed to get _meta from [trendiction]/[dsearch_en]
org.elasticsearch.indices.IndexMissingException: [_river] missing
at org.elasticsearch.cluster.metadata.MetaData.concreteIndex(MetaData.java:211)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:107)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:88)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GGetRequestBuilder.java:119)
at org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at org.elasticsearch.river.RiversService$$ApplyRivers.riverClusterChanged(RiversService.java:220)
at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-06-15 18:19:45,173][ESC[33mWARN ESC[0m][discovery
] [Thing] waited for 30s and no initial state was set by the discovery

On Wed, May 4, 2011 at 7:14 PM, Michel Conrad
<michel.conrad@trendiction.com (mailto:michel.conrad@trendiction.com)> wrote:

I will try it out as soon as you push the fix.
What might be of help, I am currently using 5 nodes with 2 replicas.
It does not happen every time, but nearly every time.
Usually it happens if the river is being started on one of the first
nodes (in the order in which the nodes are started).

On Wed, May 4, 2011 at 7:02 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Opened an issue
here: Possible failure to start a river after cluster restart · Issue #902 · elastic/elasticsearch · GitHub. I can't
recreate it here, can you recreate it on your end? I will push a fix, but
without being able to recreate it, you will need to verify it....

On Wednesday, May 4, 2011 at 5:41 PM, Michel Conrad wrote:

Hi,

When I shutdown the entire cluster, the river fails to start on the
next startup.
I am using the head version of the 0.16 branch and a local gateway.

The following warning is logged:
[2011-05-04 16:19:39,401][WARN ][river ]
[Moonstone] failed to get _meta from [testcluster]/[testriver]
org.elasticsearch.action.NoShardAvailableActionException: [_river][0]
No shard available for [testriver#_meta]
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:193)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.performFirst(TransportShardSingleOperationAction.java:159)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:116)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:69)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:45)
at org.elasticsearch.action.support.BaseAction.execute(BaseAction.java:61)
at org.elasticsearch.client.node.NodeClient.get(NodeClient.java:152)
at
org.elasticsearch.client.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:119)
at
org.elasticsearch.client.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:56)
at
org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:217)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:126)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It seems that the riverservice tries to get the _meta before the state
has been restored and never retries afterwards.
If I manually kill the node issuing the warning, the river gets
correctly started on another node.

Best,
Michel