3 nodes, replicas=2, entire cluster goes down after losing one node?

kimchy · October 31, 2011, 5:19pm

It seems like the state looks good. Indices are there, shards are allocated
on the remaining nodes. When happens when you execute "count" on the spi
index for example, does it fail?

On Mon, Oct 31, 2011 at 4:26 PM, Mike Peters mike@softwareprojects.comwrote:

If it helps, here's the cluster state right after we restart the node
that goes down (doesn't matter which one, it's always the same
symptom):
3rd node restarted · GitHub

On Oct 28, 1:51 am, Shay Banon kim...@gmail.com wrote:

Does it completely stop responding to any request? Can you gist the
result
of cluster state (with pretty): curl host:9200/_cluster/state?pretty=1.

On Wed, Oct 26, 2011 at 9:56 AM, Mike Peters <m...@softwareprojects.com
wrote:

Hi,

Using Elasticsearch 0.17.9

We have 3 nodes, with these settings:

gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2

index:
number_of_shards: 3
number_of_replicas: 2

As soon as any one node goes down, all other nodes stop responding to
requests with this error:

[2011-10-26 02:38:14,677][DEBUG][action.admin.indices.status]
[Inferno] [spi][0], node[h4GIhuqDTS2KriM3_VM-Mw], [R], s[STARTED]:
Failed to execute

[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@17c7a8f3

]
org.elasticsearch.transport.RemoteTransportException: [Lodestone]
[inet[/10.8.197.136:9300]][indices/status/shard]
Caused by: org.elasticsearch.indices.IndexMissingException: [spi]
missing
at

org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalI
ndicesService.java:

at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.
shardOperation(TransportIndicesStatusAction.java:

at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.
shardOperation(TransportIndicesStatusAction.java:

at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio
n

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja
va:

at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationActio
n

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.ja
va:

at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
[2011-10-26 02:38:14,860][INFO ][cluster.service ] [Inferno]
removed {[Lodestone][h4GIhuqDTS2KriM3_VM-Mw][inet[/
10.8.197.136:9300]],}, reason: zen-disco-receive(from master [[Stilt-
Man][c53Wh69UR_-z0t4MiFJ0VA][inet[/10.8.197.138:9300]]])

--

Any idea what are we doing wrong?

Thanks,
Mike Peters

Topic		Replies	Views
0.19.10 - cluster wedged, most operations failing Elasticsearch	4	488	July 6, 2017
Disappearing Shards Elasticsearch	10	430	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	589	July 6, 2017
Elasticsearch error all shards failed on single node Elasticsearch	7	1390	March 20, 2023
Failed to start shard Elasticsearch	7	428	July 6, 2017

3 nodes, replicas=2, entire cluster goes down after losing one node?

Related topics