Hey,
Running master from yesterday morning. Have never hit this issue
before and was previously running master from a week before that. I
have two data nodes running:
dm-adsearchd102-ElasticSearch
dm-adsearchd103-ElasticSearch
I shutdown the d103 node via the service stop command and saw this in
the logs for the d102 node:
[2010-11-16 16:25:17,764][INFO ][discovery.zen ] [DM-
ADSEARCHD102.dev.local-ElasticSearch] master_left [[dm-
adsearchd103.dev.local-ElasticSearch][l3Q3CX84SpGfcq-WzPDaUg][inet[/
10.2.20.164:9300]]], reason [shut_down]
[2010-11-16 16:25:17,775][INFO ][cluster.service ] [DM-
ADSEARCHD102.dev.local-ElasticSearch] master {new [DM-
ADSEARCHD102.dev.local-ElasticSearch][7yu0RHxETUKROtrrGxBhIw][inet[/
10.2.20.160:9300]], previous [dm-adsearchd103.dev.local-ElasticSearch]
[l3Q3CX84SpGfcq-WzPDaUg][inet[/10.2.20.164:9300]]}, removed {[dm-
adsearchd103.dev.local-ElasticSearch][l3Q3CX84SpGfcq-WzPDaUg][inet[/
10.2.20.164:9300]],}, reason: zen-disco-master_failed ([dm-
adsearchd103.dev.local-ElasticSearch][l3Q3CX84SpGfcq-WzPDaUg][inet[/
10.2.20.164:9300]])
This looks correct, however, we were running two applications with
Node based clients and they logged nothing at the time of the initial
disconnect, although, I did see a search fail for
AlreadyClosedException. A few minutes later, I see:
2010-11-16 16:28:15,879 WARN > [dm-adsearchd102.dev.local-
essearcherserver] master_left and no other node elected to become
master, current nodes: [[WKSPPEARCYW7.wsod.local-ESIndexer]
[L2QpS2IfTLqsZRf1OB76rw][inet[/10.2.123.16:9300]]{client=true,
data=false}, [dm-adsearchd102.dev.local-ESIndexer]
[qrSmIjycSA2pQ0OtVcdDng][inet[/10.2.20.160:9301]]{client=true,
data=false}, [dm-adsearchd102.dev.local-essearcherserver]
[7G6qabvhRKuAuxRBN92t6Q][inet[/10.2.20.160:9302]]{client=true,
data=false}] (Log4jESLogger.internalWarn:87)(elasticsearch[dm-
adsearchd102.dev.local-essearcherserver]clusterService#updateTask-
pool-5-thread-1)
Afterwards, the node clients failed all search and index requests.
Didn't notice immediately, 20 minutes later noticed, and then
restarted both apps using node clients and they both came up cleanly.
I'm not sure if somehow there was a split brain condition, more like
different nodes had a disagreement on who was in the cluster. I do see
the node client attempting to re-establish communications with the one
active node, but it is failing:
2010-11-16 16:28:19,746 WARN > [dm-adsearchd102.dev.local-
essearcherserver] failed to send ping to [[#zen_unicast_1#][inet[DM-
ADSEARCHD102.dev.local/10.2.20.160:9300]]] (Log4jESLogger.internalWarn:
91)(elasticsearch[dm-adsearchd102.dev.local-essearcherserver][tp]-
pool-1-thread-23)
org.elasticsearch.transport.ReceiveTimeoutTransportException: []
[inet[DM-ADSEARCHD102.dev.local/10.2.20.160:9300]][discovery/zen/
unicast]
at org.elasticsearch.transport.TransportService
$TimeoutTimerTask.run(TransportService.java:316)
at org.elasticsearch.timer.TimerService$ThreadedTimerTask
$1.run(TimerService.java:113)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
On the one good data node, it seems to be ignoring join requests:
[2010-11-16 16:31:42,880][WARN ][discovery.zen ] [DM-
ADSEARCHD102.dev.local-ElasticSearch] received a join request for an
existing node [[dm-adsearchd102.dev.local-essearcherserver]
[7G6qabvhRKuAuxRBN92t6Q][inet[/10.2.20.160:9302]]{client=true,
data=false}]
I can make all the logs available, if that would help.
In my app, I believe if I had closed the node client and re-opened, it
probably would have addressed.
Please let me know if you need any other details.
Thanks,
Paul