When restarting a test cluster, stop all nodes before starting any new nodes

Jamshid · March 21, 2013, 4:31pm

Just thought I'd share the cause of a cluster startup problem I was having,
in case anyone else makes the same mistake in their deployment script.

I wrote a simple shell script to deploy elasticsearch on three nodes, used
in a nightly test environment.

I had a problem where (only!) sometimes the cluster would start red. Looked
like there was a problem with a master failing. Node es1's log showed it
detected_master es3:

[2013-03-21 06:49:14,949][INFO ][cluster.service ]
[es1.tx.example.com] detected_master
[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],
added
{[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(from master
[[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]])

but then master_left quickly because "[transport disconnected (with
verified connect)]" and zen-disco-master_failed:

[2013-03-21 06:49:14,990][INFO ][discovery.zen ]
[es1.tx.example.com] master_left
[[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]],
reason [transport disconnected (with verified connect)]
[2013-03-21 06:49:15,056][INFO ][discovery ]
[es1.tx.example.com]
unittest-es1.tx.example.com-es2.tx.example.com-es3.tx.example.com/63m5eMNYR1-3xHgRPlYclA
[2013-03-21 06:49:15,057][INFO ][cluster.service ]
[es1.tx.example.com] master {new
[es1.tx.example.com][63m5eMNYR1-3xHgRPlYclA][inet[/172.30.10.191:9300]],
previous
[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]},
removed
{[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-master_failed
([es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]])

Eventually node es3 would reappear, but note it has a different unique id
("MYc6..." instead of "4uxm...").

[2013-03-21 06:49:23,941][INFO ][cluster.service ]
[es1.tx.example.com] added
{[es3.tx.example.com][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(join from
node[[es3.tx.example.com][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]]])

Of course, the problem was that I was looping through each node serially,
killing any existing es process before starting a new one. So, sometimes a
new node would come up before an old (master) node was killed.

for ELASTICSEARCH_HOST in ${ELASTICSEARCH_HOSTS[@]}
do # ideally this would be done in parallel on each host
ssh root@${ELASTICSEARCH_HOST} "pkill -9 -f 'java.*elasticsearch'"
...
ssh root@${ELASTICSEARCH_HOST} "cd ${REMOTE_DIR};
./elasticsearch-${ELASTICSEARCH_VERSION}/bin/elasticsearch
-Des.max-open-files=true"
...
done

The fix is just to loop through all nodes and kill each es first, then
start them up.

--Jamshid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jerome_Gagnon · March 25, 2013, 7:43pm

Why not use the shutdown cluster api instead ? much much safer IMO.

On Thursday, March 21, 2013 12:31:16 PM UTC-4, Jamshid wrote:

Just thought I'd share the cause of a cluster startup problem I was
having, in case anyone else makes the same mistake in their deployment
script.

I wrote a simple shell script to deploy elasticsearch on three nodes, used
in a nightly test environment.

I had a problem where (only!) sometimes the cluster would start red.
Looked like there was a problem with a master failing. Node es1's log
showed it detected_master es3:

[2013-03-21 06:49:14,949][INFO ][cluster.service ] [
es1.tx.example.com] detected_master [es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],
added {[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(from master [[es3.tx.example.com
][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]])

but then master_left quickly because "[transport disconnected (with
verified connect)]" and zen-disco-master_failed:

[2013-03-21 06:49:14,990][INFO ][discovery.zen ] [
es1.tx.example.com] master_left [[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]],
reason [transport disconnected (with verified connect)]
[2013-03-21 06:49:15,056][INFO ][discovery ] [
es1.tx.example.com]
unittest-es1.tx.example.com-es2.tx.example.com-es3.tx.example.com/63m5eMNYR1-3xHgRPlYclA
[2013-03-21 06:49:15,057][INFO ][cluster.service ] [
es1.tx.example.com] master {new [es1.tx.example.com][63m5eMNYR1-3xHgRPlYclA][inet[/172.30.10.191:9300]],
previous [es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]},
removed {[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-master_failed ([es3.tx.example.com
][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]])

Eventually node es3 would reappear, but note it has a different unique id
("MYc6..." instead of "4uxm...").

[2013-03-21 06:49:23,941][INFO ][cluster.service ] [
es1.tx.example.com] added {[es3.tx.example.com][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(join from node[[es3.tx.example.com
][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]]])

Of course, the problem was that I was looping through each node serially,
killing any existing es process before starting a new one. So, sometimes a
new node would come up before an old (master) node was killed.

for ELASTICSEARCH_HOST in ${ELASTICSEARCH_HOSTS[@]}
do # ideally this would be done in parallel on each host
ssh root@${ELASTICSEARCH_HOST} "pkill -9 -f 'java.*elasticsearch'"
...
ssh root@${ELASTICSEARCH_HOST} "cd ${REMOTE_DIR};
./elasticsearch-${ELASTICSEARCH_VERSION}/bin/elasticsearch
-Des.max-open-files=true"
...
done

The fix is just to loop through all nodes and kill each es first, then
start them up.

--Jamshid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Unexpected cluster behavior Elasticsearch	3	345	July 6, 2017
Could not get cluster status after master node goes down Elasticsearch	51	5768	July 13, 2018
Stopping and Staring a big cluster : best practice? Elasticsearch	4	451	July 6, 2017
Cluster health times out Elasticsearch	18	1644	July 6, 2017
Elastic Search restart Elasticsearch	10	974	March 28, 2017

When restarting a test cluster, stop all nodes before starting any new nodes

Related topics