Unstable cluster

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive logging?

Thanks,
Thibaut

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number of
replicas, expected [1], got [0]]
at org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive logging?

Thanks,
Thibaut

Just a last comment:

The cluster also never reaches an equilibrium. There are always shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number of
replicas, expected [1], got [0]]
at org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive logging?

Thanks,
Thibaut

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ] [Frog-Man]
failed to create river
[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ] [Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super
Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node
[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node
[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number of
replicas, expected [1], got [0]]
at org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],
source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive logging?

Thanks,
Thibaut

Hey,

There shouldn't be a problem with rivers being created on the master
node. Wondering what the actual failure when creating the river is, an
exception stack trace should be logged as well. The hard problem here is
that there is "outside" code that I don't really know what it does, can you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz <thibaut.britz@trendiction.com

wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ] [Frog-Man]
failed to create river
[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ] [Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super
Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node
[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node
[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number of
replicas, expected [1], got [0]]
at
org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at
org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at
org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at
org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Zaladane][inet[/192.168.6.10:9300
]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Rodstvow][inet[/192.168.6.11:9300
]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)
at org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)
at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

private boolean createOrUpdateIndex(String mapping, String indexName,

String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s", indexName, typeName);

	while (true) {
		try {
			if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);
client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)
.setSource("{"status":"active", "reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e) instanceof
IndexAlreadyExistsException) {
log.info("index %s already exists, trying to set mapping", indexName);
try {
client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);
log.info("mapping set on existing index");
usedIndices.add(indexName);
return true;
} catch (Exception e2) {
log.warn("failed to update mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create index %s, sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

}

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for task.
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)
at com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)
at com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)
at com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)
at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the master
node. Wondering what the actual failure when creating the river is, an
exception stack trace should be logged as well. The hard problem here is
that there is "outside" code that I don't really know what it does, can you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ] [Frog-Man]
failed to create river
[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ] [Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super
Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron
Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron
Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number of
replicas, expected [1], got [0]]
at
org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at
org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at
org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at
org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ] [Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at
org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ] [Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at
org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

Its still not clear if the river implementation you have causes this or not.
I suggest changing the river implementation you have to do nothing, and see
if the cluster gets into a stable state. If it does, then we can walk
through what the river does and see where the problem is, and what can be
fixed on either elasticsearch or the river implementation.

On Wed, Oct 5, 2011 at 2:43 PM, Thibaut Britz <thibaut.britz@trendiction.com

wrote:

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)
at
org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

   private boolean createOrUpdateIndex(String mapping, String

indexName,
String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s",
indexName, typeName);

           while (true) {
                   try {
                           if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);

client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)
.setSource("{"status":"active",
"reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e)
instanceof
IndexAlreadyExistsException) {
log.info("index %s already exists,
trying to set mapping", indexName);
try {

client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);
log.info("mapping set on
existing index");
usedIndices.add(indexName);
return true;
} catch (Exception e2) {
log.warn("failed to update
mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create index %s,
sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

   }

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for task.
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)
at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the master
node. Wondering what the actual failure when creating the river is, an
exception stack trace should be logged as well. The hard problem here is
that there is "outside" code that I don't really know what it does, can
you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ] [Frog-Man]
failed to create river
[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ] [Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super
Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300
]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node
[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300
]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node
[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron
Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron
Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number
of
replicas, expected [1], got [0]]
at

org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)

   at

org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)

   at

org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)

   at

org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in
production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing) running
on that cluster and 400 indexes (including all replicas and shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch timeouts).
Cluster state switches from yellow to red every few minutes, while
the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened correctly
(_status show a NullpointerException). Closing and reopneing the
index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300
]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[Lazarus][inet[/192.168.5.3:9300
]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300
]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ]
[Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

[2011-10-05 00:06:50,921][WARN ][river ]
[Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

Before doing the change, I even managed on a restart to get a complete
healty cluster (green), with all rivers running! But after restarting,
I had the same problems again.

I then made the code changes and updated our code to directly exit in
the run() funtion on all our 3 rivvers. The cluster doesn't properly
come up. The rivers seem to start even before _river is availble..

The rivers still won't start, with the same exception:
[2011-10-05 16:00:20,974][WARN ][river ] [Ironclad]
failed to create river [trendiction][dsearch_abc]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_abc][_status],
source[{"ok":true,"node":{"id":"DNo8F8NXQEenx2ZaApGecQ","name":"Ironclad","transport_address":"inet[/192.168.5.2:9300]"}}]}
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How is the cluster state stored? will the default replica values be
used also for the cluster state (in our case just 1, 1 shard). Might
this be an issue?

Here is our default config from one of the master nodes:

Configuration Template for ES

Cluster Settings

cluster:
name: abc

path:
logs: /var/log/elasticsearch/
work: /hd2/elasticsearch_work/
data: /hd2/elasticsearch_data/

index:
number_of_replicas: 1
number_of_shards: 1
query:
bool:
max_clause_count: 10240
analysis:
analyzer:
abc:
type: standard
filter: [standard, lowercase, stop, asciifolding]
abc2:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop, asciifolding]
cache:
filter:
type: none

Gateway Settings

gateway:
recover_after_nodes: 35
recover_after_time: 2m
expected_nodes: 40
gateway.local.compress: false
gateway.local.pretty: true

discovery:
zen.ping_timeout: 20s
zen.minimum_master_nodes: 3
zen.ping:
multicast:
enabled: false
unicast:
hosts: intr6n1[9300],intr6n2[9300],intr6n3[9300],intr6n4[9300],intr6n5[9300]
zen.fd.ping_timeout: 45s

script.native.relevance.type:
com.abc.modules.searchclient.scoring.RelevanceScore

node.master: true

node.data: false

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 2:49 PM, Shay Banon kimchy@gmail.com wrote:

Its still not clear if the river implementation you have causes this or not.
I suggest changing the river implementation you have to do nothing, and see
if the cluster gets into a stable state. If it does, then we can walk
through what the river does and see where the problem is, and what can be
fixed on either elasticsearch or the river implementation.

On Wed, Oct 5, 2011 at 2:43 PM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)
at
org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

   private boolean createOrUpdateIndex(String mapping, String

indexName,
String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s",
indexName, typeName);

           while (true) {
                   try {
                           if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);

client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)
.setSource("{"status":"active",
"reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e)
instanceof
IndexAlreadyExistsException) {
log.info("index %s already exists,
trying to set mapping", indexName);
try {

client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);
log.info("mapping set on
existing index");
usedIndices.add(indexName);
return true;
} catch (Exception e2) {
log.warn("failed to update
mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create index
%s, sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

   }

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for task.
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)
at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the master
node. Wondering what the actual failure when creating the river is, an
exception stack trace should be logged as well. The hard problem here is
that there is "outside" code that I don't really know what it does, can
you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ] [Frog-Man]
failed to create river
[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ] [Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super

Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ] [Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always
shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException: [Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong number
of
replicas, expected [1], got [0]]
at

org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at

org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at

org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at

org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in
production
and recently upgraded to 0.17.7. We have many rivers rivers (custom
plugin, which also searches the same cluster while indexing)
running
on that cluster and 400 indexes (including all replicas and
shards).
Shards are set to 1 by default on all indexes. All indexes have at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous version)
Queries sometimes execute fast, then they are dead slow or actions
just timeout after a while (I suspect local elasticsearch
timeouts).
Cluster state switches from yellow to red every few minutes, while
the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened
correctly
(_status show a NullpointerException). Closing and reopneing the
index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ]
[Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ]
[Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

How many shards and replicas do you have for the _river index? It defaults
to 1/1 if nto set, and in you config you have 1/1, but I wonder if you set a
different value for it?

The failure is strange..., just before creating the index, a GET is executed
against the _river/[river_name]/_meta document in order to get the river
configuration. With the results of it, the river gets created, and then
a _river/[river_name]/_status document is written. This is where it fails in
writing the document because there are not shards available (for a node that
just managed to read the _meta)... .

Its a very long shot that it relates to the problem you see, but I noticed
a problem in how the GET gets executed in terms of threading model. I pushed
a fix to both 0.17 branch and master, can you give the 0.17 branch a go? If
you need a build, just ping me over IRC.

-shay.banon

On Wed, Oct 5, 2011 at 4:08 PM, Thibaut Britz <thibaut.britz@trendiction.com

wrote:

Before doing the change, I even managed on a restart to get a complete
healty cluster (green), with all rivers running! But after restarting,
I had the same problems again.

I then made the code changes and updated our code to directly exit in
the run() funtion on all our 3 rivvers. The cluster doesn't properly
come up. The rivers seem to start even before _river is availble..

The rivers still won't start, with the same exception:
[2011-10-05 16:00:20,974][WARN ][river ] [Ironclad]
failed to create river [trendiction][dsearch_abc]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_abc][_status],

source[{"ok":true,"node":{"id":"DNo8F8NXQEenx2ZaApGecQ","name":"Ironclad","transport_address":"inet[/
192.168.5.2:9300]"}}]}
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at
org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How is the cluster state stored? will the default replica values be
used also for the cluster state (in our case just 1, 1 shard). Might
this be an issue?

Here is our default config from one of the master nodes:

Configuration Template for ES

Cluster Settings

cluster:
name: abc

path:
logs: /var/log/elasticsearch/
work: /hd2/elasticsearch_work/
data: /hd2/elasticsearch_data/

index:
number_of_replicas: 1
number_of_shards: 1
query:
bool:
max_clause_count: 10240
analysis:
analyzer:
abc:
type: standard
filter: [standard, lowercase, stop, asciifolding]
abc2:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop, asciifolding]
cache:
filter:
type: none

Gateway Settings

gateway:
recover_after_nodes: 35
recover_after_time: 2m
expected_nodes: 40
gateway.local.compress: false
gateway.local.pretty: true

discovery:
zen.ping_timeout: 20s
zen.minimum_master_nodes: 3
zen.ping:
multicast:
enabled: false
unicast:
hosts:
intr6n1[9300],intr6n2[9300],intr6n3[9300],intr6n4[9300],intr6n5[9300]
zen.fd.ping_timeout: 45s

script.native.relevance.type:
com.abc.modules.searchclient.scoring.RelevanceScore

node.master: true

node.data: false

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 2:49 PM, Shay Banon kimchy@gmail.com wrote:

Its still not clear if the river implementation you have causes this or
not.
I suggest changing the river implementation you have to do nothing, and
see
if the cluster gets into a stable state. If it does, then we can walk
through what the river does and see where the problem is, and what can be
fixed on either elasticsearch or the river implementation.

On Wed, Oct 5, 2011 at 2:43 PM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by:
org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)

   at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)

   at

org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)

   at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)

   at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)

   at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

   at

org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

   private boolean createOrUpdateIndex(String mapping, String

indexName,
String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s",
indexName, typeName);

           while (true) {
                   try {
                           if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);

client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)
.setSource("{"status":"active",
"reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e)
instanceof
IndexAlreadyExistsException) {
log.info("index %s already
exists,
trying to set mapping", indexName);
try {

client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);

                                           log.info("mapping set on

existing index");

usedIndices.add(indexName);

                                           return true;
                                   } catch (Exception e2) {
                                           log.warn("failed to

update

mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create index
%s, sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

   }

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for
task.
at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)

   at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)

   at

com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)
at
com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)
at

com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)

   at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the
master
node. Wondering what the actual failure when creating the river is, an
exception stack trace should be logged as well. The hard problem here
is
that there is "outside" code that I don't really know what it does,
can
you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ]
[Frog-Man]
failed to create river

[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]

[2011-10-05 00:42:18,008][WARN ][river ]
[Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super

Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300
]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300
]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always
shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException:
[Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong
number
of
replicas, expected [1], got [0]]
at

org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)

   at

org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)

   at

org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)

   at

org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in
production
and recently upgraded to 0.17.7. We have many rivers rivers
(custom
plugin, which also searches the same cluster while indexing)
running
on that cluster and 400 indexes (including all replicas and
shards).
Shards are set to 1 by default on all indexes. All indexes have
at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous
version)
Queries sometimes execute fast, then they are dead slow or
actions
just timeout after a while (I suspect local elasticsearch
timeouts).
Cluster state switches from yellow to red every few minutes,
while
the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened
correctly
(_status show a NullpointerException). Closing and reopneing the
index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of
these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300
]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Lazarus][inet[/192.168.5.3:9300
]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300
]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ]
[Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

[2011-10-05 00:06:50,921][WARN ][river ]
[Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

Just to be clear, thats the GET of the _meta doc, not a problem in the GET
API.

On Wed, Oct 5, 2011 at 5:30 PM, Shay Banon kimchy@gmail.com wrote:

How many shards and replicas do you have for the _river index? It defaults
to 1/1 if nto set, and in you config you have 1/1, but I wonder if you set a
different value for it?

The failure is strange..., just before creating the index, a GET is
executed against the _river/[river_name]/_meta document in order to get the
river configuration. With the results of it, the river gets created, and
then a _river/[river_name]/_status document is written. This is where it
fails in writing the document because there are not shards available (for a
node that just managed to read the _meta)... .

Its a very long shot that it relates to the problem you see, but I
noticed a problem in how the GET gets executed in terms of threading model.
I pushed a fix to both 0.17 branch and master, can you give the 0.17 branch
a go? If you need a build, just ping me over IRC.

-shay.banon

On Wed, Oct 5, 2011 at 4:08 PM, Thibaut Britz <
thibaut.britz@trendiction.com> wrote:

Before doing the change, I even managed on a restart to get a complete
healty cluster (green), with all rivers running! But after restarting,
I had the same problems again.

I then made the code changes and updated our code to directly exit in
the run() funtion on all our 3 rivvers. The cluster doesn't properly
come up. The rivers seem to start even before _river is availble..

The rivers still won't start, with the same exception:
[2011-10-05 16:00:20,974][WARN ][river ] [Ironclad]
failed to create river [trendiction][dsearch_abc]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_abc][_status],

source[{"ok":true,"node":{"id":"DNo8F8NXQEenx2ZaApGecQ","name":"Ironclad","transport_address":"inet[/
192.168.5.2:9300]"}}]}
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at
org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How is the cluster state stored? will the default replica values be
used also for the cluster state (in our case just 1, 1 shard). Might
this be an issue?

Here is our default config from one of the master nodes:

Configuration Template for ES

Cluster Settings

cluster:
name: abc

path:
logs: /var/log/elasticsearch/
work: /hd2/elasticsearch_work/
data: /hd2/elasticsearch_data/

index:
number_of_replicas: 1
number_of_shards: 1
query:
bool:
max_clause_count: 10240
analysis:
analyzer:
abc:
type: standard
filter: [standard, lowercase, stop, asciifolding]
abc2:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop, asciifolding]
cache:
filter:
type: none

Gateway Settings

gateway:
recover_after_nodes: 35
recover_after_time: 2m
expected_nodes: 40
gateway.local.compress: false
gateway.local.pretty: true

discovery:
zen.ping_timeout: 20s
zen.minimum_master_nodes: 3
zen.ping:
multicast:
enabled: false
unicast:
hosts:
intr6n1[9300],intr6n2[9300],intr6n3[9300],intr6n4[9300],intr6n5[9300]
zen.fd.ping_timeout: 45s

script.native.relevance.type:
com.abc.modules.searchclient.scoring.RelevanceScore

node.master: true

node.data: false

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 2:49 PM, Shay Banon kimchy@gmail.com wrote:

Its still not clear if the river implementation you have causes this or
not.
I suggest changing the river implementation you have to do nothing, and
see
if the cluster gets into a stable state. If it does, then we can walk
through what the river does and see where the problem is, and what can
be
fixed on either elasticsearch or the river implementation.

On Wed, Oct 5, 2011 at 2:43 PM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by:
org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)

   at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)

   at

org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)

   at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)

   at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)

   at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)

   at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

   at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)

   at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)

   at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)

   at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

   at

org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

   private boolean createOrUpdateIndex(String mapping, String

indexName,
String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s",
indexName, typeName);

           while (true) {
                   try {
                           if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);

client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)

.setSource("{"status":"active",

"reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e)
instanceof
IndexAlreadyExistsException) {
log.info("index %s already
exists,
trying to set mapping", indexName);
try {

client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);

                                           log.info("mapping set

on

existing index");

usedIndices.add(indexName);

                                           return true;
                                   } catch (Exception e2) {
                                           log.warn("failed to

update

mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create index
%s, sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

   }

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for
task.
at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)

   at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)

   at

com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)

   at

com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)

   at

com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)

   at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the
master
node. Wondering what the actual failure when creating the river is,
an
exception stack trace should be logged as well. The hard problem here
is
that there is "outside" code that I don't really know what it does,
can
you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ]
[Frog-Man]
failed to create river

[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]

[2011-10-05 00:42:18,008][WARN ][river ]
[Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super

Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300
]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300
]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300
]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300
]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300
]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300
]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300
]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always
shards
that are not assigned/initialized and must be reassigned: (Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException:
[Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong
number
of
replicas, expected [1], got [0]]
at

org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)

   at

org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)

   at

org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)

   at

org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in
production
and recently upgraded to 0.17.7. We have many rivers rivers
(custom
plugin, which also searches the same cluster while indexing)
running
on that cluster and 400 indexes (including all replicas and
shards).
Shards are set to 1 by default on all indexes. All indexes have
at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous
version)
Queries sometimes execute fast, then they are dead slow or
actions
just timeout after a while (I suspect local elasticsearch
timeouts).
Cluster state switches from yellow to red every few minutes,
while
the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened
correctly
(_status show a NullpointerException). Closing and reopneing the
index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of
these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300
]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Lazarus][inet[/192.168.5.3:9300
]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300
]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ]
[Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

[2011-10-05 00:06:50,921][WARN ][river ]
[Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException: [_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/
192.168.6.10:9300]"}}]}

   at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)

   at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

   at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more agressive
logging?

Thanks,
Thibaut

Everything up and running again. The fix works!

THANKS A LOT! :slight_smile:

On Wed, Oct 5, 2011 at 5:31 PM, Shay Banon kimchy@gmail.com wrote:

Just to be clear, thats the GET of the _meta doc, not a problem in the GET
API.

On Wed, Oct 5, 2011 at 5:30 PM, Shay Banon kimchy@gmail.com wrote:

How many shards and replicas do you have for the _river index? It defaults
to 1/1 if nto set, and in you config you have 1/1, but I wonder if you set a
different value for it?
The failure is strange..., just before creating the index, a GET is
executed against the _river/[river_name]/_meta document in order to get the
river configuration. With the results of it, the river gets created, and
then a _river/[river_name]/_status document is written. This is where it
fails in writing the document because there are not shards available (for a
node that just managed to read the _meta)... .
Its a very long shot that it relates to the problem you see, but I
noticed a problem in how the GET gets executed in terms of threading model.
I pushed a fix to both 0.17 branch and master, can you give the 0.17 branch
a go? If you need a build, just ping me over IRC.
-shay.banon
On Wed, Oct 5, 2011 at 4:08 PM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Before doing the change, I even managed on a restart to get a complete
healty cluster (green), with all rivers running! But after restarting,
I had the same problems again.

I then made the code changes and updated our code to directly exit in
the run() funtion on all our 3 rivvers. The cluster doesn't properly
come up. The rivers seem to start even before _river is availble..

The rivers still won't start, with the same exception:
[2011-10-05 16:00:20,974][WARN ][river ] [Ironclad]
failed to create river [trendiction][dsearch_abc]
org.elasticsearch.action.UnavailableShardsException: [_river][0] [5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_abc][_status],

source[{"ok":true,"node":{"id":"DNo8F8NXQEenx2ZaApGecQ","name":"Ironclad","transport_address":"inet[/192.168.5.2:9300]"}}]}
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at
org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How is the cluster state stored? will the default replica values be
used also for the cluster state (in our case just 1, 1 shard). Might
this be an issue?

Here is our default config from one of the master nodes:

Configuration Template for ES

Cluster Settings

cluster:
name: abc

path:
logs: /var/log/elasticsearch/
work: /hd2/elasticsearch_work/
data: /hd2/elasticsearch_data/

index:
number_of_replicas: 1
number_of_shards: 1
query:
bool:
max_clause_count: 10240
analysis:
analyzer:
abc:
type: standard
filter: [standard, lowercase, stop, asciifolding]
abc2:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop, asciifolding]
cache:
filter:
type: none

Gateway Settings

gateway:
recover_after_nodes: 35
recover_after_time: 2m
expected_nodes: 40
gateway.local.compress: false
gateway.local.pretty: true

discovery:
zen.ping_timeout: 20s
zen.minimum_master_nodes: 3
zen.ping:
multicast:
enabled: false
unicast:
hosts:
intr6n1[9300],intr6n2[9300],intr6n3[9300],intr6n4[9300],intr6n5[9300]
zen.fd.ping_timeout: 45s

script.native.relevance.type:
com.abc.modules.searchclient.scoring.RelevanceScore

node.master: true

node.data: false

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 2:49 PM, Shay Banon kimchy@gmail.com wrote:

Its still not clear if the river implementation you have causes this or
not.
I suggest changing the river implementation you have to do nothing, and
see
if the cluster gets into a stable state. If it does, then we can walk
through what the river does and see where the problem is, and what can
be
fixed on either elasticsearch or the river implementation.

On Wed, Oct 5, 2011 at 2:43 PM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hello Shay,

My best guess is the following. It looks like the cluster is in an
inconsistent state, and that's why some rivers start and the other
can't start. They think that the river doesn't exist and try to create
it, but this fails due to the river already existing somewhere else.

E.g.
Searching sometimes fails with (depending on the node):

Caused by: org.elasticsearch.transport.RemoteTransportException:
[Penance II][inet[/46.4.123.140:9300]][indices/search]
Caused by:
org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure; shardFailures
{RemoteTransportException[[Xi'an Chi
Xan][inet[/192.168.6.2:9300]][search/phase/query]]; nested:
IndexMissingException[[abc_000000000000] missing]; }
at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:257)
at

org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:210)
at

org.elasticsearch.search.action.SearchServiceTransportAction$2.handleException(SearchServiceTransportAction.java:151)
at

org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:158)
at

org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:149)
at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:101)
at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
at

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
at

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at

org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The index is not missing, since the cluster health was yellow at that
time. Doing the same search on another server works without any
problems.

The head also contains invalid information about the running rivvers.
They run on different servers opposed to what's displayed in the
interface. These are the rivers that fail then when they try to save
their status.

eg:
abc-ddd [active] inet[/192.168.5.5:9300]
I can't find any entry for that river on that host, so it's either
wrong (or hasn't been updated).

It runs on a completely different host, where it was sucessfully
started but fails to save its status:

The save function which triggers the exception looks like this:

   private boolean createOrUpdateIndex(String mapping, String

indexName,
String typeName, boolean force) throws InterruptedException
{
if (usedIndices.contains(indexName)) return true;
log.info("creating or updating the index: %s/%s",
indexName, typeName);

           while (true) {
                   try {
                           if (mapping != null)

client.admin().indices().prepareCreate(indexName).addMapping(typeName,
mapping).execute().actionGet(timeout);

client.prepareIndex(SearchClient.index_indexstatus,
conf.riverName, indexName)

.setSource("{"status":"active",
"reason":"created",
"metainfo":"indexstatus"}").execute().actionGet(timeout);
log.info("created index");
usedIndices.add(indexName);
return true;
} catch (Exception e) {
if (ExceptionsHelper.unwrapCause(e)
instanceof
IndexAlreadyExistsException) {
log.info("index %s already
exists,
trying to set mapping", indexName);
try {

client.admin().indices().preparePutMapping(indexName).setType(typeName).setSource(mapping).execute().actionGet(timeout);
log.info("mapping set
on
existing index");

usedIndices.add(indexName);
return true;
} catch (Exception e2) {
log.warn("failed to
update
mapping for %s, sleeping 10s", e2, indexName);
Thread.sleep(10000);
}
} else {
log.warn("failed to create
index
%s, sleeping 10s", e, indexName);
Thread.sleep(10000);
}
}
}

   }

It then fails:

[2011-10-05 14:19:09,511][^[[33mWARN
^[[0m][com.trendiction.modules.searchclient.river.TWDSIndexer] failed
to create index abc_000000000000, sleeping 10s
org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for
task.
at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:70)
at

org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:59)
at

com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:759)
at

com.trendiction.modules.searchclient.river.TWDSIndexer.a(SourceFile:734)
at

com.trendiction.modules.searchclient.river.TWDSIndexer.run(SourceFile:603)
at java.lang.Thread.run(Thread.java:662)

I also tried the latest head version of 0.17, it didn't change
anything

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 11:55 AM, Shay Banon kimchy@gmail.com wrote:

Hey,
There shouldn't be a problem with rivers being created on the
master
node. Wondering what the actual failure when creating the river is,
an
exception stack trace should be logged as well. The hard problem
here is
that there is "outside" code that I don't really know what it does,
can
you
maybe remove all logic from your river and see what happens?

On Wed, Oct 5, 2011 at 2:36 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

I might be wrong, but I think I think it's due to rivers also being
started on masters:

I have the following config on my master node:
node.master: true
node.data: false
Rivers are still being started on the master node however.

Is it possible to skip the master for rivers as well?

I had the following log file entries, which indicate that the river
which couldn't be started was blocking the master node? (all
expiration requests arrive at the same time)

[2011-10-05 00:41:18,006][WARN ][river ]
[Frog-Man]
failed to create river

[trendictionrebuilder][dsearch_prosearch-deloitte-cactusnew_rebuilder]
[2011-10-05 00:42:18,008][WARN ][river ]
[Frog-Man]
failed to write failed status for river creation
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119909ms]
ago, timed out [89909ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [29563]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node [[Super

Rabbit][oDiF2fXITfm68BQTX8uKgw][inet[/192.168.5.13:9300]]{master=false}],
id [29559]
[2011-10-05 00:42:18,011][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119910ms]
ago, timed out [89910ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [29575]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59819ms]
ago, timed out [29819ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Odin][xCTnKPHuSniiPhOUqGJ4Qw][inet[/192.168.6.10:9300]]{master=false}],
id [32275]
[2011-10-05 00:42:18,012][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [59820ms]
ago, timed out [29820ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Lunatica][w1NJvXjaQgO8Ibgcgq7fKQ][inet[/192.168.5.1:9300]]{master=true}],
id [32263]
[2011-10-05 00:42:18,023][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [89901ms]
ago, timed out [59901ms] ago, action
[/cluster/nodes/indices/shard/store/node], node

[[Mimic][E_VPjWIcRkKclDsy9La2zw][inet[/192.168.5.2:9300]]{master=true}],
id [30987]
[2011-10-05 00:42:18,027][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119200ms]
ago, timed out [74200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [29934]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74200ms]
ago, timed out [29200ms] ago, action [discovery/zen/fd/ping], node

[[Kilmer][4f85JcXAQsGwkrr-c8Q75g][inet[/192.168.6.19:9300]]{master=false}],
id [31538]
[2011-10-05 00:42:18,028][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119708ms]
ago, timed out [74707ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [29882]
[2011-10-05 00:42:18,029][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74708ms]
ago, timed out [29708ms] ago, action [discovery/zen/fd/ping], node

[[Charon][911x_X6URVqC7_XwTw0XwA][inet[/192.168.5.6:9300]]{master=false}],
id [31523]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [119886ms]
ago, timed out [74886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [29583]
[2011-10-05 00:42:18,036][WARN ][transport ]
[Frog-Man]
Received response for a request that has timed out, sent [74886ms]
ago, timed out [29886ms] ago, action [discovery/zen/fd/ping], node
[[Baron

Samedi][rwYxIVA2TD26ZB7v3Bw_3Q][inet[/192.168.6.3:9300]]{master=false}],
id [31517]

On Wed, Oct 5, 2011 at 12:31 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Just a last comment:

The cluster also never reaches an equilibrium. There are always
shards
that are not assigned/initialized and must be reassigned:
(Cluster
state red/yellow changes constantly). It is never green.
{
cluster_name: abc
status: red
timed_out: false
number_of_nodes: 40
number_of_data_nodes: 40
active_primary_shards: 229
active_shards: 452
relocating_shards: 0
initializing_shards: 25
unassigned_shards: 4
}

On Wed, Oct 5, 2011 at 12:12 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

After a while,

RoutingValidationException also turn up:

org.elasticsearch.cluster.routing.RoutingValidationException:
[Index
[dsearch_en_001859508000]: Shard [0] routing table has wrong
number
of
replicas, expected [1], got [0]]
at

org.elasticsearch.cluster.routing.RoutingTable.validateRaiseException(RoutingTable.java:97)
at

org.elasticsearch.cluster.routing.allocation.ShardsAllocation.applyStartedShards(ShardsAllocation.java:77)
at

org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:188)
at

org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:189)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Thanks,
Thibaut

On Wed, Oct 5, 2011 at 12:11 AM, Thibaut Britz
thibaut.britz@trendiction.com wrote:

Hi,

We have a farily large elasticsearch cluster (40 nodes) in
production
and recently upgraded to 0.17.7. We have many rivers rivers
(custom
plugin, which also searches the same cluster while indexing)
running
on that cluster and 400 indexes (including all replicas and
shards).
Shards are set to 1 by default on all indexes. All indexes have
at
least 1 replica, the important ones 4 or 5. All data is stored
locally.

The cluster runs extremely unstable. (Also on the previous
version)
Queries sometimes execute fast, then they are dead slow or
actions
just timeout after a while (I suspect local elasticsearch
timeouts).
Cluster state switches from yellow to red every few minutes,
while
the
load on all the machines is very very low.

Also when starting, sometimes, some indexes are not opened
correctly
(_status show a NullpointerException). Closing and reopneing
the
index
fixes this issue.

After a few minutes, the master, after being elected and while
distributing the loading of the local indexes, gets a few of
these
exceptions

[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [YH9V43nrSnGsBWc5VOrkGw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Zaladane][inet[/192.168.6.10:9300]][/gateway/local/started-shards/node]
request_id [48654] timed out after [30001ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [Vi2_pzU0RYqG1iuietEfvw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Lazarus][inet[/192.168.5.3:9300]][/gateway/local/started-shards/node]
request_id [48664] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:47,124][DEBUG][gateway.local ]
[Geirrodur] failed to execute on node [0wYatYDkRSCfmnij5OxjQg]
org.elasticsearch.transport.ReceiveTimeoutTransportException:

[Rodstvow][inet[/192.168.6.11:9300]][/gateway/local/started-shards/node]
request_id [48669] timed out after [30000ms]
at

org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:315)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The nodes only have these log file entries:

[2011-10-05 00:05:50,920][WARN ][river ]
[Zaladane]
failed to create river [abc][dsearch_other_1-3]
org.elasticsearch.action.UnavailableShardsException:
[_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-05 00:06:50,921][WARN ][river ]
[Zaladane]
failed to write failed status for river creation
org.elasticsearch.action.UnavailableShardsException:
[_river][0]
[5]
shardIt, [1] active : Timeout waiting for [1m], request: index
{[_river][dsearch_other_1-3][_status],

source[{"ok":true,"node":{"id":"YH9V43nrSnGsBWc5VOrkGw","name":"Zaladane","transport_address":"inet[/192.168.6.10:9300]"}}]}
at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$3.onTimeout(TransportShardReplicationOperationAction.java:455)
at

org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:305)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

There are no ERROR log file entries in the log.

Any idea what might cause this? How can I turn on more
agressive
logging?

Thanks,
Thibaut