Node showing twice in the cluster (same IP/port)

Hi,

We run an 8 instances ES cluster in EC2, one index (size: 3695.5gb),
48 shards, 1 replica per shard. ES version is 0.20.2, Oracle JVM
1.6.38, 30GB heap, total RAM 60GB, 2 * 1024GB SSD drives in raid 0
(lvm stripe), 16 cores.

A few days ago around 7:17 UTC one of the nodes has been rebooted
(underlying host issue according to AWS - still investigating the root
cause). Once rebooted it connected to the cluster but showed up twice
in the cluster config:

O4iZm6dyTU6NT2Z4WiIZlg: {
name: es-6385b.domain.com
transport_address: inet[/10.x.x.149:9300]
attributes: {
aws_availability_zone: us-east-1b
max_local_storage_nodes: 1
}
.....
.....
}
sLam2ByVTQyPcc4Nzk9oMQ: {
name: es-6385b.domain.com
transport_address: inet[/10.x.x.149:9300]
attributes: {
aws_availability_zone: us-east-1b
max_local_storage_nodes: 1
}

Where sLam2ByVTQyPcc4Nzk9oMQ is the original node ID and
O4iZm6dyTU6NT2Z4WiIZlg is the ID after the restart.

After a while the node is removed from the cluster, from the logs it
seems that the master is still trying to connect to
sLam2ByVTQyPcc4Nzk9oMQ although this node doesn't exist anymore.

Around 10:13 UTC the node es-6385b.domain.com runs out of memory:

[2013-07-23 10:13:44,271][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception
in the selector loop.
java.lang.OutOfMemoryError: Java heap space

At this point the master notices the old node ID is no longer contactable:

[2013-07-23 12:20:54,355][WARN ][cluster.service ]
[es-6388d.domain.com] failed to reconnect to node
[es-6385b.domain.com][sLam2ByVTQyPcc4Nzk9oMQ][inet[/10.x.x.149:9300]]{aws_availability_zone=us-east-1b,
max_local_storage_nodes=1}

org.elasticsearch.transport.ConnectTransportException:
[es-6385b.domain.com][inet[/10.x.x.149:9300]] connect_timeout[30s]

But still the node is not removed from the cluster (note the old ID
sLam2ByVTQyPcc4Nzk9oMQ in the log file), there are a lot of those
messages in the logs but the master doesn't evict the node from the
cluster (why?).

After a while es-6388d.domain.com is restarted and again it tries to
join the cluster with a new ID but it still shows twice in the config,
with the original ID pre reboot and with the new one after the latest
restart.

This is only resolved with a full cluster restart (or at least this is
the only way we managed to 'solve' it).

I have tried to collect as much info as I could and I have all the
logs for inspection, I would really appreciate any help in
understanding how this situation (duplicate node) can happen, what are
the possible issues resulting form it (routing?shard
distribution?queries?), and how we can resolve the problem if it ever
happens again (or better prevent it) without a full cluster restart.

Thanks,

Simone

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.