ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers

This is on 0.15.2-1 (I can't easily move up to 0.16.* in the near
future, so just let me know if this has been fixed in a more recent
version and I'll shut up - from a quick scan I couldn't see any
candidate fixes in the change files though)

On a 2-node (test/reference) cluster running on EC2, one of the nodes
("node A") ran out of memory (due to one of the other processes) and
hung for a few minutes (eg no ssh connectivity) before the offending
process was killed and it returned to normal.

During that time "node B" detected a loss of connectivity to "node A"
and removed it. When "node A" recovered, it did not get added back to
"node B"'s list.

So at that point:

  • nodeA believed it was part of a 2-node cluster (and shared new
    documents posted to it)
  • nodeB believed it was part of a 1-node cluster (and obviously did
    not pass any new documents to nodeA)

The log file entries from "node B", which might make the above clearer
(just pasted below below since it's so short):

--
<>
2011-05-31 12:10:32.942 [INFO] discovery.ec2:79 - [Roma] master_left
[[Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]]], reason [failed
to ping, tried [3] times, each with maximum [30s] timeout]
2011-05-31 12:10:32.944 [INFO] cluster.service:79 - [Roma] master {new
[Roma][NODEB][inet[/10.84.45.179:9300]], previous [Jameson, J. Jonah]
[NODEA][inet[/10.201.3.150:9300]]}, removed {[Jameson, J. Jonah][NODEA]
[inet[/10.201.3.150:9300]],}, reason: zen-disco-master_failed
([Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]])
2011-05-31 12:20:03.317 [WARN] discovery.ec2:87 - [Roma] master should
not receive new cluster state from [[Jameson, J. Jonah][NODEA][inet[/
10.201.3.150:9300]]]
(and this last message repeated every few seconds forever....)

My YML configuration is very simple (just pasted below below since
it's so short):

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:

Once I restarted node B (first manually deleting all its documents to
make life simpler), everything returned to a working state (with "node
A" as the master and "node B" as the slave).

Again - apologies if this is already fixed. If not, shall I create an
issue? Not 100% what the desired functionality should be ("node A"
regains mastership? "node A" becomes a slave?), but presumably not the
above :slight_smile:

Any thoughts appreciated - and great job on ElasticSearch!

Alex
apiggott@ikanow.com

Something similar to what you described has been fixed in 0.16. Maybe you can give it a go?

On Wednesday, June 1, 2011 at 8:28 PM, Alex at Ikanow wrote:

This is on 0.15.2-1 (I can't easily move up to 0.16.* in the near
future, so just let me know if this has been fixed in a more recent
version and I'll shut up - from a quick scan I couldn't see any
candidate fixes in the change files though)

On a 2-node (test/reference) cluster running on EC2, one of the nodes
("node A") ran out of memory (due to one of the other processes) and
hung for a few minutes (eg no ssh connectivity) before the offending
process was killed and it returned to normal.

During that time "node B" detected a loss of connectivity to "node A"
and removed it. When "node A" recovered, it did not get added back to
"node B"'s list.

So at that point:

  • nodeA believed it was part of a 2-node cluster (and shared new
    documents posted to it)
  • nodeB believed it was part of a 1-node cluster (and obviously did
    not pass any new documents to nodeA)

The log file entries from "node B", which might make the above clearer
(just pasted below below since it's so short):

--
<>
2011-05-31 12:10:32.942 [INFO] discovery.ec2:79 - [Roma] master_left
[[Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]]], reason [failed
to ping, tried [3] times, each with maximum [30s] timeout]
2011-05-31 12:10:32.944 [INFO] cluster.service:79 - [Roma] master {new
[Roma][NODEB][inet[/10.84.45.179:9300]], previous [Jameson, J. Jonah]
[NODEA][inet[/10.201.3.150:9300]]}, removed {[Jameson, J. Jonah][NODEA]
[inet[/10.201.3.150:9300]],}, reason: zen-disco-master_failed
([Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]])
2011-05-31 12:20:03.317 [WARN] discovery.ec2:87 - [Roma] master should
not receive new cluster state from [[Jameson, J. Jonah][NODEA][inet[/
10.201.3.150:9300]]]
(and this last message repeated every few seconds forever....)

My YML configuration is very simple (just pasted below below since
it's so short):

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:

Once I restarted node B (first manually deleting all its documents to
make life simpler), everything returned to a working state (with "node
A" as the master and "node B" as the slave).

Again - apologies if this is already fixed. If not, shall I create an
issue? Not 100% what the desired functionality should be ("node A"
regains mastership? "node A" becomes a slave?), but presumably not the
above :slight_smile:

Any thoughts appreciated - and great job on Elasticsearch!

Alex
apiggott@ikanow.com (mailto:apiggott@ikanow.com)