Just thought I'd share the cause of a cluster startup problem I was having,
in case anyone else makes the same mistake in their deployment script.
I wrote a simple shell script to deploy elasticsearch on three nodes, used
in a nightly test environment.
I had a problem where (only!) sometimes the cluster would start red. Looked
like there was a problem with a master failing. Node es1's log showed it
detected_master es3:
[2013-03-21 06:49:14,949][INFO ][cluster.service ]
[es1.tx.example.com] detected_master
[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],
added
{[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(from master
[[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]])
but then master_left quickly because "[transport disconnected (with
verified connect)]" and zen-disco-master_failed:
[2013-03-21 06:49:14,990][INFO ][discovery.zen ]
[es1.tx.example.com] master_left
[[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]],
reason [transport disconnected (with verified connect)]
[2013-03-21 06:49:15,056][INFO ][discovery ]
[es1.tx.example.com]
unittest-es1.tx.example.com-es2.tx.example.com-es3.tx.example.com/63m5eMNYR1-3xHgRPlYclA
[2013-03-21 06:49:15,057][INFO ][cluster.service ]
[es1.tx.example.com] master {new
[es1.tx.example.com][63m5eMNYR1-3xHgRPlYclA][inet[/172.30.10.191:9300]],
previous
[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]]},
removed
{[es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]],},
reason: zen-disco-master_failed
([es3.tx.example.com][4uxmLPLgS7qBlPo7Uv2AGA][inet[/172.30.10.193:9300]])
Eventually node es3 would reappear, but note it has a different unique id
("MYc6..." instead of "4uxm...").
[2013-03-21 06:49:23,941][INFO ][cluster.service ]
[es1.tx.example.com] added
{[es3.tx.example.com][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]],},
reason: zen-disco-receive(join from
node[[es3.tx.example.com][MYc6kHstS-6MbLgucUjIXg][inet[/172.30.10.193:9300]]])
Of course, the problem was that I was looping through each node serially,
killing any existing es process before starting a new one. So, sometimes a
new node would come up before an old (master) node was killed.
for ELASTICSEARCH_HOST in ${ELASTICSEARCH_HOSTS[@]}
do # ideally this would be done in parallel on each host
ssh root@${ELASTICSEARCH_HOST} "pkill -9 -f 'java.*elasticsearch'"
...
ssh root@${ELASTICSEARCH_HOST} "cd ${REMOTE_DIR};
./elasticsearch-${ELASTICSEARCH_VERSION}/bin/elasticsearch
-Des.max-open-files=true"
...
done
The fix is just to loop through all nodes and kill each es first, then
start them up.
--Jamshid
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.