Hi,
Using ElasticSearch 0.17.9
We have 3 nodes, with these settings:
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
index:
number_of_shards: 3
number_of_replicas: 2
As soon as any one node goes down, all other nodes stop responding to
requests with this error:
[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f3]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:
227)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
134)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
58)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
381)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
370)
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])
--
Any idea what are we doing wrong?
Thanks,
Mike Peters
kimchy
(Shay Banon)
October 28, 2011, 5:51am
2
Does it completely stop responding to any request? Can you gist the result
of cluster state (with pretty): curl host:9200/_cluster/state?pretty=1.
On Wed, Oct 26, 2011 at 9:56 AM, Mike Peters mike@softwareprojects.com wrote:
Hi,
Using Elasticsearch 0.17.9
We have 3 nodes, with these settings:
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
index:
number_of_shards: 3
number_of_replicas: 2
As soon as any one node goes down, all other nodes stop responding to
requests with this error:
[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f3
]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:
227)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
134)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
58)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
381)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
370)
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])
--
Any idea what are we doing wrong?
Thanks,
Mike Peters
Yes - all nodes in the cluster refuse connection as soon as one node
goes down, although we have 3 nodes with replicas = 2
Any help would be highly appreciated!
Here's the cluster state:
gistfile1.txt
curl -XGET "http://10.29.60.8:9200/_cluster/state?pretty=1"
{
"cluster_name" : "SPI",
"master_node" : "Ub0MuTriRe2r-PgD2PPMsg",
"blocks" : {
},
"nodes" : {
"Ub0MuTriRe2r-PgD2PPMsg" : {
"name" : "Projector",
"transport_address" : "inet[/10.29.60.8:9300]",
This file has been truncated. show original
On Oct 28, 1:51 am, Shay Banon kim...@gmail.com wrote:
Does it completely stop responding to any request? Can you gist the result
of cluster state (with pretty): curl host:9200/_cluster/state?pretty=1.
On Wed, Oct 26, 2011 at 9:56 AM, Mike Peters m...@softwareprojects.com wrote:
Hi,
Using Elasticsearch 0.17.9
We have 3 nodes, with these settings:
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
index:
number_of_shards: 3
number_of_replicas: 2
As soon as any one node goes down, all other nodes stop responding to
requests with this error:
[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f 3
]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalI ndicesService.java:
227)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction. shardOperation(TransportIndicesStatusAction.java:
134)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction. shardOperation(TransportIndicesStatusAction.java:
58)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja va:
381)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja va:
370)
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])
--
Any idea what are we doing wrong?
Thanks,
Mike Peters
If it helps, here's the cluster state right after we restart the node
that goes down (doesn't matter which one, it's always the same
symptom):
gistfile1.txt
curl -XGET "http://10.29.60.8:9200/_cluster/state?pretty=1"
{
"cluster_name" : "SPI",
"master_node" : "Ub0MuTriRe2r-PgD2PPMsg",
"blocks" : {
},
"nodes" : {
"Ub0MuTriRe2r-PgD2PPMsg" : {
"name" : "Projector",
"transport_address" : "inet[/10.29.60.8:9300]",
This file has been truncated. show original
On Oct 28, 1:51 am, Shay Banon kim...@gmail.com wrote:
Does it completely stop responding to any request? Can you gist the result
of cluster state (with pretty): curl host:9200/_cluster/state?pretty=1.
On Wed, Oct 26, 2011 at 9:56 AM, Mike Peters m...@softwareprojects.com wrote:
Hi,
Using Elasticsearch 0.17.9
We have 3 nodes, with these settings:
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
index:
number_of_shards: 3
number_of_replicas: 2
As soon as any one node goes down, all other nodes stop responding to
requests with this error:
[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f 3
]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalI ndicesService.java:
227)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction. shardOperation(TransportIndicesStatusAction.java:
134)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction. shardOperation(TransportIndicesStatusAction.java:
58)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja va:
381)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja va:
370)
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])
--
Any idea what are we doing wrong?
Thanks,
Mike Peters
kimchy
(Shay Banon)
October 31, 2011, 5:19pm
5
It seems like the state looks good. Indices are there, shards are allocated
on the remaining nodes. When happens when you execute "count" on the spi
index for example, does it fail?
On Mon, Oct 31, 2011 at 4:26 PM, Mike Peters mike@softwareprojects.com wrote:
If it helps, here's the cluster state right after we restart the node
that goes down (doesn't matter which one, it's always the same
symptom):
3rd node restarted · GitHub
On Oct 28, 1:51 am, Shay Banon kim...@gmail.com wrote:
Does it completely stop responding to any request? Can you gist the
result
of cluster state (with pretty): curl host:9200/_cluster/state?pretty=1.
On Wed, Oct 26, 2011 at 9:56 AM, Mike Peters <m...@softwareprojects.com
wrote:
Hi,
Using Elasticsearch 0.17.9
We have 3 nodes, with these settings:
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
index:
number_of_shards: 3
number_of_replicas: 2
As soon as any one node goes down, all other nodes stop responding to
requests with this error:
[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f3
]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalI
ndicesService.java:
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.
shardOperation(TransportIndicesStatusAction.java:
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.
shardOperation(TransportIndicesStatusAction.java:
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio
n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja
va:
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio
n
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja
va:
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])
--
Any idea what are we doing wrong?
Thanks,
Mike Peters