Newly created rivers are not being started

Hi,
when I try to create a new river, it is not being started, even with the
dummy river.
By looking into the jstack of the master node, I can see that the master
node hangs while waiting for a request. I'm on 0.19.8.

"elasticsearch[node1][riverClusterService#updateTask][T#1]" daemon prio=10
tid=0x000000004145c800 nid=0x2eaa waiting on condition [0x00007fa354078000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000003fce5b7f8> (a
org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
at
org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:271)
at
org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
at
org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:109)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:103)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are already some rivers running in the cluster, but no new ones can
be created. I expect that after I will restart the cluster,
things will work again. Would it make sense to set a timeout on this
actionGet here, so that we can retry the request later?

Best regards,
Michel

Michel Conrad wrote:

when I try to create a new river, it is not being started, even
with the dummy river. By looking into the jstack of the master
node, I can see that the master node hangs while waiting for a
request. I'm on 0.19.8.

It looks like you might have a bad node. It's blocking waiting to
get the _meta doc. Is the cluster healthy? Does the master log
mention anything about this river? If you restart and still have
trouble you could try turning on TRACE level logging and trying to
reproduce.

[...]

org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:109)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:103)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are already some rivers running in the cluster, but no new
ones can be created. I expect that after I will restart the
cluster, things will work again. Would it make sense to set a
timeout on this actionGet here, so that we can retry the request
later?

You could time it out, but I suspect you may have something more
fundamental wrong with the cluster. Make sure it's healthy first.

-Drew

Hi Drew,

You're correct, indeed when the river that did not start up was created
(automatically), the cluster was in a red state, which was caused
by the recent upgrade from 0.18 to 0.19.
We issued a _flush before upgrading, but as some unrelated indices were
closed during the _flush and upgrade,
they did not handle their transaction log. After the upgrade, opening one
of these indices caused the errors
on the corresponding shards (failed to recover shard, string index out of
range:0), but not on the _river index.
Reclosing, deleting the transaction logs and reopening turned the cluster
to green again. But the get on the _meta
doc continued to hang until we restarted.

After restarting the cluster, the rivers started as expected. So my
question is, why the get on _meta hangs while the _river index is green,
and why it continues to hang after we got the cluster to green again, or if
in that case a timeout and a retry logic would help to recover from this
issue without restarting.

Best,
Michel

On Mon, Jul 16, 2012 at 7:22 PM, Drew Raines aaraines@gmail.com wrote:

Michel Conrad wrote:

when I try to create a new river, it is not being started, even
with the dummy river. By looking into the jstack of the master
node, I can see that the master node hangs while waiting for a
request. I'm on 0.19.8.

It looks like you might have a bad node. It's blocking waiting to
get the _meta doc. Is the cluster healthy? Does the master log
mention anything about this river? If you restart and still have
trouble you could try turning on TRACE level logging and trying to
reproduce.

[...]

org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:109)

    at

org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:103)

    at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

    at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

    at java.lang.Thread.run(Thread.java:662)

There are already some rivers running in the cluster, but no new
ones can be created. I expect that after I will restart the
cluster, things will work again. Would it make sense to set a
timeout on this actionGet here, so that we can retry the request
later?

You could time it out, but I suspect you may have something more
fundamental wrong with the cluster. Make sure it's healthy first.

-Drew

Michel Conrad wrote:

[...]

After restarting the cluster, the rivers started as expected. So my
question is, why the get on _meta hangs while the _river index is
green, and why it continues to hang after we got the cluster to
green again, or if in that case a timeout and a retry logic would
help to recover from this issue without restarting.

My guess is that since the that get hangs in the RiversRouter
clusterChanged() event method, it doesn't see any more state updates.

Note that you might also be able to delete the river and recreate it
instead of completely restarting the cluster. I've had some trouble
with hung rivers as well and is on my todo list to address.

-Drew

I also first thought about recreating the river, but it didn't help. Also
creating
another river didn't work. So it might be that
the RiverClusterStateUpdateTask in the
RiversRouter blocks waiting to get the answer for the get of the meta
document,
which neither completes, nor times out, because of the cluster state,
even after the cluster has recovered and then, as you suggested the
clusterChanged()
method creates new RiverClusterStateUpdateTasks, which are not being called
anymore,
because the former one is still blocking.

Maybe killing the master node would have worked, so that the new master
node would retry
the creation of the river.

On Tue, Jul 17, 2012 at 7:11 PM, Drew Raines aaraines@gmail.com wrote:

Michel Conrad wrote:

[...]

After restarting the cluster, the rivers started as expected. So my
question is, why the get on _meta hangs while the _river index is
green, and why it continues to hang after we got the cluster to
green again, or if in that case a timeout and a retry logic would
help to recover from this issue without restarting.

My guess is that since the that get hangs in the RiversRouter
clusterChanged() event method, it doesn't see any more state updates.

Note that you might also be able to delete the river and recreate it
instead of completely restarting the cluster. I've had some trouble
with hung rivers as well and is on my todo list to address.

-Drew