Master keeps forgeting nodes

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on master,
act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on master,
act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses: (filter_client[true],
filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection against
master [Cordelia Frost], reason [new cluster state received and we are
monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping latency
should not be the problem. I also do hourly snapshots with curator, in case
that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c367dc4-c388-4b9c-aa91-34d6fcadb156%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Slight update: The same problem also happens on another cluster with the
same on another AWS account.
While this does not happen on my test account, that's probably related to
the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on master,
act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on master,
act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses: (filter_client[true],
filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection against
master [Cordelia Frost], reason [new cluster state received and we are
monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia
Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping
latency should not be the problem. I also do hourly snapshots with curator,
in case that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Are you running across AZs, or regions?

On 6 April 2015 at 21:01, João Costa jdpc557@gmail.com wrote:

Slight update: The same problem also happens on another cluster with the
same on another AWS account.
While this does not happen on my test account, that's probably related to
the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on
master, act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on master,
act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses: (filter_client[true],
filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection against
master [Cordelia Frost], reason [new cluster state received and we are
monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia
Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping
latency should not be the problem. I also do hourly snapshots with curator,
in case that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9j0_FgnSw9rQtNE1dTshwrt6U9pb70DzEEsTJJUmA%2B2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The next time this happens can you check the _cat outputs, take a look at
Hanging transport connection thread on EC2 · Issue #10447 · elastic/elasticsearch · GitHub and see if it's
similar behaviour.

On 7 April 2015 at 07:09, Mark Walkom markwalkom@gmail.com wrote:

Are you running across AZs, or regions?

On 6 April 2015 at 21:01, João Costa jdpc557@gmail.com wrote:

Slight update: The same problem also happens on another cluster with the
same on another AWS account.
While this does not happen on my test account, that's probably related to
the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on
master, act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on master,
act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses: (filter_client[true],
filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection against
master [Cordelia Frost], reason [new cluster state received and we are
monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia
Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping
latency should not be the problem. I also do hourly snapshots with curator,
in case that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8Hr4hiqeVaCf_wUTjtr0RN3L%2B1QyR0VTY4HG-7N7rz1Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

All machines are on the same region, the AZ is different though.

When you say "check the _cat outputs", you mean making a call to
_cat/indices or _cat/shards when I know that the cluster is down, correct?
I'll try to do that, then.

Em segunda-feira, 6 de abril de 2015 23:32:51 UTC+1, Mark Walkom escreveu:

The next time this happens can you check the _cat outputs, take a look at
Hanging transport connection thread on EC2 · Issue #10447 · elastic/elasticsearch · GitHub and see if it's
similar behaviour.

On 7 April 2015 at 07:09, Mark Walkom <markw...@gmail.com <javascript:>>
wrote:

Are you running across AZs, or regions?

On 6 April 2015 at 21:01, João Costa <jdp...@gmail.com <javascript:>>
wrote:

Slight update: The same problem also happens on another cluster with the
same on another AWS account.
While this does not happen on my test account, that's probably related
to the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on
master, act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on
master, act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses:
(filter_client[true], filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection
against master [Cordelia Frost], reason [new cluster state received
and we are monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia
Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping
latency should not be the problem. I also do hourly snapshots with curator,
in case that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c0407b1-bb5e-45b2-9cd9-214c55b53990%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Both _cat/indices and _cat/shards appear to be working during the cluster
failure.

Em terça-feira, 7 de abril de 2015 14:05:02 UTC+1, João Costa escreveu:

All machines are on the same region, the AZ is different though.

When you say "check the _cat outputs", you mean making a call to
_cat/indices or _cat/shards when I know that the cluster is down, correct?
I'll try to do that, then.

Em segunda-feira, 6 de abril de 2015 23:32:51 UTC+1, Mark Walkom escreveu:

The next time this happens can you check the _cat outputs, take a look at
Hanging transport connection thread on EC2 · Issue #10447 · elastic/elasticsearch · GitHub and see if it's
similar behaviour.

On 7 April 2015 at 07:09, Mark Walkom markw...@gmail.com wrote:

Are you running across AZs, or regions?

On 6 April 2015 at 21:01, João Costa jdp...@gmail.com wrote:

Slight update: The same problem also happens on another cluster with
the same on another AWS account.
While this does not happen on my test account, that's probably related
to the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa
escreveu:

I have 2 EC2 in an AWS account where it appears that the master keeps
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against
master [Cordelia Frost], reason [master failure, do not exists on
master, act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do
not exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on
master, act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason:
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses:
(filter_client[true], filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection
against master [Cordelia Frost], reason [new cluster state received
and we are monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia
Frost])

"Chronos" then receives the cluster state and everything goes back to
normal.
This happens about on quite regular intervals (usually once per hour,
although some times it takes more time to happen). Any idea of what can be
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping
latency should not be the problem. I also do hourly snapshots with curator,
in case that's relevant.
Finally, I also have another elasticsearch cluster with the same
configuration on a different AWS account (used for testing purposes), and
that problem has never occured. Can this be related to the AWS region?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/baac8012-d1a4-4b50-96f2-d02919597fe5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.