We're still seeing node drops, and what is more bizzare is we're seeing
this on a test cluster we stood up that actually has no activity on it (no
reads or writes going to it). Does anyone have any additional thoughts?
Here is the info from the configuration and the logs we're seeing on the
drops.
SC-TLS1 - 4GB Memory 1GB Heap (Master)
SC-TLS2 - 4GB Memory 1GB Heap (Master)
SC-TLS3 - 8GB Memory 1GB Heap (Data)
SC-TLS4 - 8GB Memory 1GB Heap (Data)
SC-TLS5 - 8GB Memory 1GB Heap (Data)
PX-TLS3 - 8GB Memory 1GB Heap (Data)
PX-TLS4 - 8GB Memory 1GB Heap (Data)
PX-TLS5 - 8GB Memory 1GB Heap (Data)
Elasticsearch 1.0.1
Elasticsearch Configuration Settings
bootstrap.mlockall: true
discovery.zen.ping.timeout: 15s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.9.84.206[9300-9400]",
"10.9.84.213[9300-9400]"]
action.destructive_requires_name: true
discovery.zen.fd.ping_interval: 30s
discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 10
Events from TLS1 (Master)
[2014-05-26 22:09:22,953][INFO ][cluster.service ] [SC-TLS1]
removed
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX,
master=false},}, reason: zen-disco-receive(from master
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true}])
[2014-05-26 22:12:07,085][INFO ][cluster.service ] [SC-TLS1] added
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX,
master=false},}, reason: zen-disco-receive(from master
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true}])
Events from PX-TLS5
[2014-05-26 22:09:37,010][INFO ][discovery.zen ] [PX-TLS5]
master_left
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true}], reason [do not exists on master, act as master
failure]
[2014-05-26 22:09:37,011][INFO ][cluster.service ] [PX-TLS5]
master {new
[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC,
data=false, master=true}, previous
[SCTLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true}}, removed
{[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true},}, reason: zen-disco-master_failed
([SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC,
data=false, master=true})
[2014-05-26 22:10:07,035][INFO ][discovery.zen ] [PX-TLS5]
master_left
[[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC,
data=false, master=true}], reason [no longer master]
[2014-05-26 22:10:07,036][WARN ][discovery.zen ] [PX-TLS5] not
enough master nodes after master left (reason = no longer master), current
nodes:
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[PX-TLS5/10.9.64.223:9300]]{dc=PX,
master=false},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX,
master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC,
master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC,
master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC,
master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX,
master=false},}
[2014-05-26 22:10:07,037][INFO ][cluster.service ] [PX-TLS5]
removed
{[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC,
data=false,
master=true},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX,
master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC,
master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC,
master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC,
master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX,
master=false},}, reason: zen-disco-master_failed
([SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC,
data=false, master=true})
On Monday, April 28, 2014 9:39:04 AM UTC-6, skik...@gmail.com wrote:
So far the only log message we've seen is:
zen-disco-node_failed([CDPX-PRD-ELS4][lkquUBfHT1aXAO3-_tCNCg][cdpx-prd-els4][inet[
10.9.64.142/10.9.64.142:9300]]{master=false}
http://10.9.64.142/10.9.64.142:9300]]{master=false}), reason
failed to ping, tried [5] times, each with maximum [1m] timeout
We have other data traversing the network that would be very sensitive to
any latency or outages, in addition to alerts that would fire off if we had
a network outage, so I am confident we don't have any network issues when
this occurs. Furthermore, we are only seeing data nodes drop, the masters
never drop.
Is there a recommended heap size for nodes that are masters only? In
addition, any recommendations on heap size for data nodes? I assume this
could be a timeout in general during GC processes as our data nodes have
larger heaps?
On Friday, April 25, 2014 5:49:44 PM UTC-6, Alexander Reelsen wrote:
Hey,
is there any reason in the logfile of the master node, why it was
deelected? (network outage as well)? Did you give your master nodes also a
huge heap which could cause long outages during GC?
--Alex
On Mon, Apr 21, 2014 at 5:51 PM, skik...@gmail.com wrote:
We currently are running dedicated master nodes but I believe they are
also servicing queries. I can change it such that queries only hit the
data nodes and see if that eliminates the issue...
On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
Other than network, is it possible that your nodes could sometimes be
overloaded such that they cannot respond immediately? If that's the case,
then you can probably get 3 nodes (servers), make them master-only nodes
(node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes:
2 for those 3 nodes. And then for the rest of your other data nodes, make
them non-master eligible (node.master: false, node.data: true). This way
you have 3 nodes dedicated only to do cluster state/master tasks unimpeded
by load or anything else other than your network. Just don't run anything
else on them or send queries/indexing jobs to these 3 nodes.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dffb154d-080d-4366-980c-b9401a9b3859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.