Nodes randomly disconnected from the ES cluster


(Anil Karaka) #1

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-43-55][inet[/153.31.73.55:9300]],},
reason:
zen-disco-node_failed([ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-73-55][inet[/153.31.73.55:9300]]),
reason transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-76-111][inet[/153.31.76.111:9300]],},
reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-36-101][inet[/153.31.36.101:9300]]),
reason transport disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]],},
reason:
zen-disco-node_failed([ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]),
reason transport disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.75.190:9300]],},
reason:
zen-disco-node_failed([ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]),
reason transport disconnected

And in the data node, this is how the node leaving the cluster looks like
in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ] [ESBigNode1]
master left (reason = do not exists on master, act as master failure),
current nodes:
{[ESBigNode2][zVdCNza9Qk-v-Usu66jcvw][ip-153-31-73-29][inet[/153.31.73.29:9300]],[ESBigNode4][-8pj8n2sS5GB4XTIE0zudQ][ip-153-31-74-230][inet[/153.31.74.230:9300]],[ESBigNode1][nU6bkV-SSb6rvLHsth9AQg][ip-153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster is
in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this
thread, https://groups.google.com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ
I will change the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS settings
that deal with ipv4 TCP keep alive settings.. Do I also have to change this
setting? https://groups.google.com/forum/#!msg/elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/727d0b5f-1dbf-4ce6-ab11-067b20513c76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Anil Karaka) #2

I also use amazon, aws cloud plugin and discover my nodes based on the
security group..

should I instead change it to unicast discovery?

On Wednesday, April 1, 2015 at 12:15:41 PM UTC+5:30, Anil Karaka wrote:

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-43-55][inet[/153.31.73.55:9300]],},
reason:
zen-disco-node_failed([ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-73-55][inet[/153.31.73.55:9300]]),
reason transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-76-111][inet[/153.31.76.111:9300]],},
reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-36-101][inet[/153.31.36.101:9300]]),
reason transport disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]],},
reason:
zen-disco-node_failed([ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]),
reason transport disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.75.190:9300]],},
reason:
zen-disco-node_failed([ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]),
reason transport disconnected

And in the data node, this is how the node leaving the cluster looks like
in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ] [ESBigNode1]
master left (reason = do not exists on master, act as master failure),
current nodes:
{[ESBigNode2][zVdCNza9Qk-v-Usu66jcvw][ip-153-31-73-29][inet[/153.31.73.29:9300]],[ESBigNode4][-8pj8n2sS5GB4XTIE0zudQ][ip-153-31-74-230][inet[/153.31.74.230:9300]],[ESBigNode1][nU6bkV-SSb6rvLHsth9AQg][ip-153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster is
in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this thread,
https://groups.google.com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ
I will change the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS
settings that deal with ipv4 TCP keep alive settings.. Do I also have to
change this setting?
https://groups.google.com/forum/#!msg/elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/26c6828f-30ba-4262-93cb-7650b1dad64c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Anil Karaka) #3

[2015-04-01 07:23:09,550][INFO ][cluster.service ] [ESBigNode5]
removed
{[ESBigNode3][X_IyUwkrQe-ae15VVKltDw][esnode3][inet[/153.31.73.30:9300]],},
reason: zen-disco-master_failed
([ESBigNode3][X_IyUwkrQe-ae15VVKltDw][esnode3][inet[/153.31.73.30:9300]])
[2015-04-01 07:24:20,456][INFO ][cluster.service ] [ESBigNode5]
detected_master
[ESBigNode3][X_IyUwkrQe-ae15VVKltDw][esnode3][inet[/153.31.73.30:9300]],
added
{[ESBigNode3][X_IyUwkrQe-ae15VVKltDw][esnode3][inet[/153.31.73.30:9300]],},
reason: zen-disco-receive(from master
[[ESBigNode3][X_IyUwkrQe-ae15VVKltDw][esnode3][inet[/153.31.73.30:9300]]])

This is how node leaves and rejoins the cluster.

On Wednesday, April 1, 2015 at 12:43:35 PM UTC+5:30, Anil Karaka wrote:

I also use amazon, aws cloud plugin and discover my nodes based on the
security group..

should I instead change it to unicast discovery?

On Wednesday, April 1, 2015 at 12:15:41 PM UTC+5:30, Anil Karaka wrote:

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-43-55][inet[/153.31.73.55:9300]],},
reason:
zen-disco-node_failed([ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-73-55][inet[/153.31.73.55:9300]]),
reason transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-76-111][inet[/153.31.76.111:9300]],},
reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-36-101][inet[/153.31.36.101:9300]]),
reason transport disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]],},
reason:
zen-disco-node_failed([ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]),
reason transport disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.75.190:9300]],},
reason:
zen-disco-node_failed([ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]),
reason transport disconnected

And in the data node, this is how the node leaving the cluster looks like
in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ] [ESBigNode1]
master left (reason = do not exists on master, act as master failure),
current nodes:
{[ESBigNode2][zVdCNza9Qk-v-Usu66jcvw][ip-153-31-73-29][inet[/153.31.73.29:9300]],[ESBigNode4][-8pj8n2sS5GB4XTIE0zudQ][ip-153-31-74-230][inet[/153.31.74.230:9300]],[ESBigNode1][nU6bkV-SSb6rvLHsth9AQg][ip-153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster
is in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this thread,
https://groups.google.com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ
I will change the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS
settings that deal with ipv4 TCP keep alive settings.. Do I also have to
change this setting?
https://groups.google.com/forum/#!msg/elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/93db58e4-dda7-4b8f-bf16-cd18b27bfba1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tomer Levy) #4

We're experiencing a similar issue with one of our clusters on EC2 which
was running 1.4.4 and it still happens after upgrade 1.5.0. We see "Master
left" messages randomly happen and then reconnect after a couple of
minutes. We have 4 data nodes and 3 master nodes (and a few client nodes).

master left (reason = do not exists on master, act as master failure)

Any thoughts?

On Wednesday, April 1, 2015 at 10:13:35 AM UTC+3, Anil Karaka wrote:

I also use amazon, aws cloud plugin and discover my nodes based on the
security group..

should I instead change it to unicast discovery?

On Wednesday, April 1, 2015 at 12:15:41 PM UTC+5:30, Anil Karaka wrote:

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-43-55][inet[/153.31.73.55:9300]],},
reason:
zen-disco-node_failed([ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-73-55][inet[/153.31.73.55:9300]]),
reason transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ] [ESBigNode3]
removed
{[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-76-111][inet[/153.31.76.111:9300]],},
reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-153-31-36-101][inet[/153.31.36.101:9300]]),
reason transport disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]],},
reason:
zen-disco-node_failed([ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]),
reason transport disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ] [ESBigNode3]
removed
{[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.75.190:9300]],},
reason:
zen-disco-node_failed([ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]),
reason transport disconnected

And in the data node, this is how the node leaving the cluster looks like
in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ] [ESBigNode1]
master left (reason = do not exists on master, act as master failure),
current nodes:
{[ESBigNode2][zVdCNza9Qk-v-Usu66jcvw][ip-153-31-73-29][inet[/153.31.73.29:9300]],[ESBigNode4][-8pj8n2sS5GB4XTIE0zudQ][ip-153-31-74-230][inet[/153.31.74.230:9300]],[ESBigNode1][nU6bkV-SSb6rvLHsth9AQg][ip-153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster
is in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this thread,
https://groups.google.com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ
I will change the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS
settings that deal with ipv4 TCP keep alive settings.. Do I also have to
change this setting?
https://groups.google.com/forum/#!msg/elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1aed1011-c01b-4228-9e16-479330895c96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #5

When you see this, can you check if _cat/indices
​,​
_cat/shards
​and​
_cat/nodes
​ return a request?​

On 7 April 2015 at 15:33, Tomer Levy tomer.levy9@gmail.com wrote:

We're experiencing a similar issue with one of our clusters on EC2 which
was running 1.4.4 and it still happens after upgrade 1.5.0. We see "Master
left" messages randomly happen and then reconnect after a couple of
minutes. We have 4 data nodes and 3 master nodes (and a few client nodes).

master left (reason = do not exists on master, act as master failure)

Any thoughts?

On Wednesday, April 1, 2015 at 10:13:35 AM UTC+3, Anil Karaka wrote:

I also use amazon, aws cloud plugin and discover my nodes based on the
security group..

should I instead change it to unicast discovery?

On Wednesday, April 1, 2015 at 12:15:41 PM UTC+5:30, Anil Karaka wrote:

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ] [ESBigNode3]
removed {[ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-43-55][
inet[/153.31.73.55:9300]],}, reason: zen-disco-node_failed([
ES30GBNode2][Yf8ODQh0TE2_0hQ35Y0M_w][ip-153-31-73-55][
inet[/153.31.73.55:9300]]), reason transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ] [ESBigNode3]
removed {[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-
153-31-76-111][inet[/153.31.76.111:9300]],}, reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-
153-31-36-101][inet[/153.31.36.101:9300]]), reason transport
disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ] [ESBigNode3]
removed {[ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/
153.31.70.128:9300]],}, reason: zen-disco-node_failed([
ESBigNode5][PaNaDPwfSM-jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]),
reason transport disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ] [ESBigNode3]
removed {[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/
153.31.75.190:9300]],}, reason: zen-disco-node_failed([ESBigNode1][
ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]), reason
transport disconnected

And in the data node, this is how the node leaving the cluster looks
like in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ] [ESBigNode1]
master left (reason = do not exists on master, act as master failure),
current nodes: {[ESBigNode2][zVdCNza9Qk-v-Usu66jcvw][ip-153-31-73-29][
inet[/153.31.73.29:9300]],[ESBigNode4][-8pj8n2sS5GB4XTIE0zudQ][ip-153-
31-74-230][inet[/153.31.74.230:9300]],[ESBigNode1][
nU6bkV-SSb6rvLHsth9AQg][ip-153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster
is in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this thread, https://groups.google.
com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ I will change
the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS
settings that deal with ipv4 TCP keep alive settings.. Do I also have to
change this setting? https://groups.google.com/forum/#!msg/
elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1aed1011-c01b-4228-9e16-479330895c96%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1aed1011-c01b-4228-9e16-479330895c96%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ZXfsc8GU%3DogwTVPF-PkYdpq6kf6j%3DLTRQ5XuLQRWe%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tomer Levy) #6

Link below seems like a good direction to solve the problem
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

[image: photo]
Tomer Levy
CEO, Co-Founder, Logz.io
p:+972-544235023 | e:tomer@logz.io | w:on.logz.io/1C2UlMi | a:
+1-617-314-3318
http://il.linkedin.com/pub/tomer-levy/1/950/360
http://twitter.com/tomerlevy
Get a signature like this:
http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
Click
here!
http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9

On Tue, Apr 7, 2015 at 9:19 AM, Mark Walkom markwalkom@gmail.com wrote:

When you see this, can you check if _cat/indices
​,​
_cat/shards
​and​
_cat/nodes
​ return a request?​

On 7 April 2015 at 15:33, Tomer Levy tomer.levy9@gmail.com wrote:

We're experiencing a similar issue with one of our clusters on EC2 which
was running 1.4.4 and it still happens after upgrade 1.5.0. We see "Master
left" messages randomly happen and then reconnect after a couple of
minutes. We have 4 data nodes and 3 master nodes (and a few client nodes).

master left (reason = do not exists on master, act as master failure)

Any thoughts?

On Wednesday, April 1, 2015 at 10:13:35 AM UTC+3, Anil Karaka wrote:

I also use amazon, aws cloud plugin and discover my nodes based on the
security group..

should I instead change it to unicast discovery?

On Wednesday, April 1, 2015 at 12:15:41 PM UTC+5:30, Anil Karaka wrote:

I greped for "removed" in master node and these are the logs that I see.

[2015-04-01 05:32:55,813][INFO ][cluster.service ]
[ESBigNode3] removed {[ES30GBNode2][Yf8ODQh0TE2_
0hQ35Y0M_w][ip-153-31-43-55][inet[/153.31.73.55:9300]],}, reason:
zen-disco-node_failed([ES30GBNode2][Yf8ODQh0TE2_
0hQ35Y0M_w][ip-153-31-73-55][inet[/153.31.73.55:9300]]), reason
transport disconnected
[2015-04-01 05:33:02,048][INFO ][cluster.service ]
[ESBigNode3] removed {[ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-
153-31-76-111][inet[/153.31.76.111:9300]],}, reason:
zen-disco-node_failed([ES30GBNode1][0CRaC261RXy8JfGc1XNLZA][ip-
153-31-36-101][inet[/153.31.36.101:9300]]), reason transport
disconnected
[2015-04-01 05:33:09,702][INFO ][cluster.service ]
[ESBigNode3] removed {[ESBigNode5][PaNaDPwfSM-
jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]],}, reason:
zen-disco-node_failed([ESBigNode5][PaNaDPwfSM-
jUpGa8HQJmQ][esnode5][inet[/153.31.70.128:9300]]), reason transport
disconnected
[2015-04-01 05:33:13,964][INFO ][cluster.service ]
[ESBigNode3] removed {[ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][
esnode1][inet[/153.31.75.190:9300]],}, reason: zen-disco-node_failed([
ESBigNode1][ihJU17ToQVit9BxNzQjhnQ][esnode1][inet[/153.31.35.190:9300]]),
reason transport disconnected

And in the data node, this is how the node leaving the cluster looks
like in its log files.

[2015-01-22 20:49:56,860][WARN ][discovery.ec2 ]
[ESBigNode1] master left (reason = do not exists on master, act as master
failure), current nodes: {[ESBigNode2][zVdCNza9Qk-v-
Usu66jcvw][ip-153-31-73-29][inet[/153.31.73.29:9300]],[ESBigNode4][-
8pj8n2sS5GB4XTIE0zudQ][ip-153-31-74-230][inet[/153.31.74.
230:9300]],[ESBigNode1][nU6bkV-SSb6rvLHsth9AQg][ip-
153-31-75-190][inet[/153.31.75.190:9300]],}

That is 4 nodes leaving the 7 node cluster at at time.. and the cluster
is in red state for few minutes, not just yellow state..
Although 4 nodes leaving the cluster is rare.. Single nodes leave the
cluster very often.

As discussed in this thread, https://groups.google.
com/forum/#!msg/elasticsearch/ixoAF9Yur0E/CgX4Hbk1ynYJ I will change
the discovery.zen.ping.timeout to 10sec, what else can I do.

there is an older thread from 2012 that also suggests to change OS
settings that deal with ipv4 TCP keep alive settings.. Do I also have to
change this setting? https://groups.google.com/forum/#!msg/
elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1aed1011-c01b-4228-9e16-479330895c96%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1aed1011-c01b-4228-9e16-479330895c96%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/E-aGhovVTPI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ZXfsc8GU%3DogwTVPF-PkYdpq6kf6j%3DLTRQ5XuLQRWe%3Dg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ZXfsc8GU%3DogwTVPF-PkYdpq6kf6j%3DLTRQ5XuLQRWe%3Dg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKf%2BfgcL5ZkreXRPArQ9fqNVTR6-bxp2FTPzsV61cdgziFg8jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


ES nodes disconnects intermittently from the cluster
#7

I think I'm facing the same issue - random disconnects, elastic cluster on EC2. Did you ever manage to solve this issue?


(Mike Salmon) #8

Anyone here manage to work out what this issue is? Having the same problem at the moment.