ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers

Alex_At_Ikanow · June 1, 2011, 5:28pm

This is on 0.15.2-1 (I can't easily move up to 0.16.* in the near
future, so just let me know if this has been fixed in a more recent
version and I'll shut up - from a quick scan I couldn't see any
candidate fixes in the change files though)

On a 2-node (test/reference) cluster running on EC2, one of the nodes
("node A") ran out of memory (due to one of the other processes) and
hung for a few minutes (eg no ssh connectivity) before the offending
process was killed and it returned to normal.

During that time "node B" detected a loss of connectivity to "node A"
and removed it. When "node A" recovered, it did not get added back to
"node B"'s list.

So at that point:

nodeA believed it was part of a 2-node cluster (and shared new
documents posted to it)
nodeB believed it was part of a 1-node cluster (and obviously did
not pass any new documents to nodeA)

The log file entries from "node B", which might make the above clearer
(just pasted below below since it's so short):

--
<>
2011-05-31 12:10:32.942 [INFO] discovery.ec2:79 - [Roma] master_left
[[Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]]], reason [failed
to ping, tried [3] times, each with maximum [30s] timeout]
2011-05-31 12:10:32.944 [INFO] cluster.service:79 - [Roma] master {new
[Roma][NODEB][inet[/10.84.45.179:9300]], previous [Jameson, J. Jonah]
[NODEA][inet[/10.201.3.150:9300]]}, removed {[Jameson, J. Jonah][NODEA]
[inet[/10.201.3.150:9300]],}, reason: zen-disco-master_failed
([Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]])
2011-05-31 12:20:03.317 [WARN] discovery.ec2:87 - [Roma] master should
not receive new cluster state from [[Jameson, J. Jonah][NODEA][inet[/
10.201.3.150:9300]]]
(and this last message repeated every few seconds forever....)

My YML configuration is very simple (just pasted below below since
it's so short):

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:

Once I restarted node B (first manually deleting all its documents to
make life simpler), everything returned to a working state (with "node
A" as the master and "node B" as the slave).

Again - apologies if this is already fixed. If not, shall I create an
issue? Not 100% what the desired functionality should be ("node A"
regains mastership? "node A" becomes a slave?), but presumably not the
above

Any thoughts appreciated - and great job on ElasticSearch!

Alex
apiggott@ikanow.com

kimchy · June 2, 2011, 7:28am

Something similar to what you described has been fixed in 0.16. Maybe you can give it a go?

On Wednesday, June 1, 2011 at 8:28 PM, Alex at Ikanow wrote:

This is on 0.15.2-1 (I can't easily move up to 0.16.* in the near
future, so just let me know if this has been fixed in a more recent
version and I'll shut up - from a quick scan I couldn't see any
candidate fixes in the change files though)

On a 2-node (test/reference) cluster running on EC2, one of the nodes
("node A") ran out of memory (due to one of the other processes) and
hung for a few minutes (eg no ssh connectivity) before the offending
process was killed and it returned to normal.

During that time "node B" detected a loss of connectivity to "node A"
and removed it. When "node A" recovered, it did not get added back to
"node B"'s list.

So at that point:

nodeA believed it was part of a 2-node cluster (and shared new
documents posted to it)

nodeB believed it was part of a 1-node cluster (and obviously did
not pass any new documents to nodeA)

The log file entries from "node B", which might make the above clearer
(just pasted below below since it's so short):

--
<>
2011-05-31 12:10:32.942 [INFO] discovery.ec2:79 - [Roma] master_left
[[Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]]], reason [failed
to ping, tried [3] times, each with maximum [30s] timeout]
2011-05-31 12:10:32.944 [INFO] cluster.service:79 - [Roma] master {new
[Roma][NODEB][inet[/10.84.45.179:9300]], previous [Jameson, J. Jonah]
[NODEA][inet[/10.201.3.150:9300]]}, removed {[Jameson, J. Jonah][NODEA]
[inet[/10.201.3.150:9300]],}, reason: zen-disco-master_failed
([Jameson, J. Jonah][NODEA][inet[/10.201.3.150:9300]])
2011-05-31 12:20:03.317 [WARN] discovery.ec2:87 - [Roma] master should
not receive new cluster state from [[Jameson, J. Jonah][NODEA][inet[/
10.201.3.150:9300]]]
(and this last message repeated every few seconds forever....)

My YML configuration is very simple (just pasted below below since
it's so short):

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:

Once I restarted node B (first manually deleting all its documents to
make life simpler), everything returned to a working state (with "node
A" as the master and "node B" as the slave).

Again - apologies if this is already fixed. If not, shall I create an
issue? Not 100% what the desired functionality should be ("node A"
regains mastership? "node A" becomes a slave?), but presumably not the
above

Any thoughts appreciated - and great job on Elasticsearch!

Alex
apiggott@ikanow.com (mailto:apiggott@ikanow.com)

Topic		Replies	Views
Master keeps forgeting nodes Elasticsearch	6	2106	July 6, 2017
ElasticSearch resilency problem Elasticsearch	1	417	July 6, 2017
Cluster connection issues when the machines hosting the nodes are restarted for service maintanance Elasticsearch	7	1056	July 6, 2017
Disappearing Shards Elasticsearch	10	414	July 6, 2017
Lost data in ElasticSearch cluster after disconnected node Elasticsearch	6	611	July 6, 2017

ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers

-- cluster: name: infinite-aws discovery: type: ec2 cloud: aws: access_key: secret_key:

-- cluster: name: infinite-aws discovery: type: ec2 cloud: aws: access_key: secret_key:

Related topics

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:

--
cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key: