Very regular disconnect and recover - every 2 hours

Hi,

I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?

(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though :frowning: )

Thanks in advance for any ideas or insight.

N

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It's not the VPN reconnecting is it?

On 31 March 2015 at 01:32, Neil Andrassy neil.andrassy@thefilter.com
wrote:

Hi,

I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?

(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though :frowning: )

Thanks in advance for any ideas or insight.

N

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9R68ymc%2BgcX%2BBYopyz_%2B%3D3MZ18fdBV-MRThHW_BNA%2Bew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.

On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:

It's not the VPN reconnecting is it?

On 31 March 2015 at 01:32, Neil Andrassy <neil.a...@thefilter.com
<javascript:>> wrote:

Hi,

I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?

(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though :frowning: )

Thanks in advance for any ideas or insight.

N

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/793f13a8-9ca8-4d86-b194-47b4e9cd5125%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You can try winding out the timeouts, see
http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection

On 31 March 2015 at 16:57, Neil Andrassy neil.andrassy@thefilter.com
wrote:

It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.

On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:

It's not the VPN reconnecting is it?

On 31 March 2015 at 01:32, Neil Andrassy neil.a...@thefilter.com wrote:

Hi,

I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?

(and yes, I know multi-site isn't a recommended configuration - there
are other challenges for us with the tribe node approach too, though :frowning: )

Thanks in advance for any ideas or insight.

N

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/793f13a8-9ca8-4d86-b194-47b4e9cd5125%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/793f13a8-9ca8-4d86-b194-47b4e9cd5125%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fbZXbP_iY4ZpJODXPmumh15CntRT6S4HaJdrvqv593A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Both clusters have the following settings so, if that's related, I think
there must be another contributing factor...

"discovery.zen.fd.ping_interval" : "1s",
"discovery.zen.fd.ping_timeout" : "60s",
"discovery.zen.fd.ping_retries" : "3",

On 31 March 2015 at 07:32, Mark Walkom markwalkom@gmail.com wrote:

You can try winding out the timeouts, see
http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection

On 31 March 2015 at 16:57, Neil Andrassy neil.andrassy@thefilter.com
wrote:

It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.

On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:

It's not the VPN reconnecting is it?

On 31 March 2015 at 01:32, Neil Andrassy neil.a...@thefilter.com
wrote:

Hi,

I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?

(and yes, I know multi-site isn't a recommended configuration - there
are other challenges for us with the tribe node approach too, though :frowning: )

Thanks in advance for any ideas or insight.

N

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b00b8bda-9238-47e8-b0f2-3d4d6751b3c2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/793f13a8-9ca8-4d86-b194-47b4e9cd5125%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/793f13a8-9ca8-4d86-b194-47b4e9cd5125%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/FLsYRpcADEk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fbZXbP_iY4ZpJODXPmumh15CntRT6S4HaJdrvqv593A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fbZXbP_iY4ZpJODXPmumh15CntRT6S4HaJdrvqv593A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Neil Andrassy | CTO | The Filter
phone | +44 (0)1225 588 004
skype | andrassynp

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABpTWLMn8k_hbuRkh3s-ZmWqZd9s1POHzMQEuFFxWVugS5%3Dxng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Further to this, the cluster that's failing has far more shards than the
one that stays up. We have a number of daily date stamped indexes, each
retaining 90 days of data amounting to 7500+ shards.

Looking through github, it looks like some underlying changes might have
impacted the scalability of clusters with larger numbers of shards ([STORE]
Cache fileLength for fully written files #9683
, ES 1.4.2 random node
disconnect #9212
and *[STORE] Add simple cache for StoreStats #9709 *seem
related). Looks like there are some improvements in 1.5.0 (particulalry
#9709). We're in the process of testing the upgrade (although there's a
regression too that's holding us up slightly - exists filters don't work on
indexes created prior to 1.3.0 Mappings: Fix _field_names to be disabled
on pre 1.3.0 indexes #10268
- fixed in 1.5.1 when it arrives). We have
also attempted to reduce the number of shards where possible as a short
term workaround. We also found that CPU usage is VERY high on the nodes
that stayed in cluster during recovery - we've dropped concurrent shard
recoveries to 1 (from the default of 3) to ease this situation.

We're also considering weekly or even monthly rolling indexes to reduce
shard count but that's a bigger code change in our application.

Watch this space...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/830a91b3-ba2e-40b2-a166-8e0e14a4f99d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Okay, looks like we had a couple of issues going on here, both of which I
think we have resolved:

  1. ES 1.4.4 was hammering CPU on the retained nodes during recovery, made
    worse by the fact that we'd dropped multiple nodes rather than just one.
    Reducing concurrent recoveries helped somewhat, but upgrading ES to 1.5.0
    has returned CPU usage during recovery to a normal level.

  2. The regular drop seems to be related to our VPN killing inactive TCP
    connections (as suspected). It seems like termination of any one of the
    many inter-node TCP connections results in all the other open and active
    connections also being dropped (including ping and cluster state). Maybe
    this behaviour has changed since the 1.0.* versions? Altering the OS TCP
    keep-alive time down from 2 hours to 5 minutes has resolved the issue and
    our cluster now remains up and stable. Be warned, though, the keep-alive
    settings are at the OS (network interface) level and, on Windows at least,
    required a full server reboot.

Hope this is useful to others and maybe can even go back into the pot for
2.0 resiliency (if it hasn't already) - better handling of
dropped/terminated but re-establishable connections?

Thanks all,

N

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5911731c-2cea-40d5-95fe-f8fa1bcd5b28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I doubt this can be a task for 2.0 resiliency when the effect can be traced
to a W$ndows TCP/IP socket default behavior together with a possible
aggressive VPN implementation.

ES is already using TCP keep-alive by default, see
http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings

What may be worth an examination is why a connection stays open and does no
longer send keep alive packets, even though ES default configuration is
telling another story. Network status dumps of open TCP connections could
be helpful.

Jörg

On Fri, Apr 10, 2015 at 12:35 PM, Neil Andrassy <neil.andrassy@thefilter.com

wrote:

Okay, looks like we had a couple of issues going on here, both of which I
think we have resolved:

  1. ES 1.4.4 was hammering CPU on the retained nodes during recovery, made
    worse by the fact that we'd dropped multiple nodes rather than just one.
    Reducing concurrent recoveries helped somewhat, but upgrading ES to 1.5.0
    has returned CPU usage during recovery to a normal level.

  2. The regular drop seems to be related to our VPN killing inactive TCP
    connections (as suspected). It seems like termination of any one of the
    many inter-node TCP connections results in all the other open and active
    connections also being dropped (including ping and cluster state). Maybe
    this behaviour has changed since the 1.0.* versions? Altering the OS TCP
    keep-alive time down from 2 hours to 5 minutes has resolved the issue and
    our cluster now remains up and stable. Be warned, though, the keep-alive
    settings are at the OS (network interface) level and, on Windows at least,
    required a full server reboot.

Hope this is useful to others and maybe can even go back into the pot for
2.0 resiliency (if it hasn't already) - better handling of
dropped/terminated but re-establishable connections?

Thanks all,

N

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5911731c-2cea-40d5-95fe-f8fa1bcd5b28%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5911731c-2cea-40d5-95fe-f8fa1bcd5b28%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEj%3Di6rNF1kysaks5udsnxuBtzAf10Z5A3RUguOz_DSgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.