Transient Network Outage and Cluster Health


(Kenneth Loafman-2) #1

Hi,

We have two clusters that both go to yellow when there is a transient
network outage. This happens overnight mostly, perhaps some form of
maintenance on the cloud providers part. The cluster never recovers the
connection and both require a restart of their secondary nodes. Is there a
setting I need to change in order to keep this from happening? The relevant
part of the config file is:

cloud:

aws:
    access_key: munged
    secret_key: munged

gateway:
type: s3
s3:
bucket: munged
recover_after_nodes: 2

network:
host: host0

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["host0:9300","host1:9300"]

...Thanks,
...Ken


(Shay Banon) #2

Currently, when a node gets disconnected from a cluster, it requires a
restart in order to rejoin the cluster, it does not join the cluster
automatically. I am working on improving on that... .

For now, maybe just increase the default fault detection timeouts? Check
this:
http://www.elasticsearch.com/docs/elasticsearch/modules/discovery/zen/#Fault_Detection.
What is the message that you get in the log when it gets disconnected?

-shay.banon

On Fri, Nov 5, 2010 at 4:00 PM, Kenneth Loafman kenneth@loafman.com wrote:

Hi,

We have two clusters that both go to yellow when there is a transient
network outage. This happens overnight mostly, perhaps some form of
maintenance on the cloud providers part. The cluster never recovers the
connection and both require a restart of their secondary nodes. Is there a
setting I need to change in order to keep this from happening? The relevant
part of the config file is:

cloud:

aws:
    access_key: munged
    secret_key: munged

gateway:
type: s3
s3:
bucket: munged
recover_after_nodes: 2

network:
host: host0

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["host0:9300","host1:9300"]

...Thanks,
...Ken


(system) #3