I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?
(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though )
I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?
(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though )
It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.
On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:
I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?
(and yes, I know multi-site isn't a recommended configuration - there are
other challenges for us with the tribe node approach too, though )
It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.
On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:
I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?
(and yes, I know multi-site isn't a recommended configuration - there
are other challenges for us with the tribe node approach too, though )
It's probably something like that, but it only seems to be a problem with
the more up to date version of ES. I'm keen to work out if there's a
configuration option I can tweak in 1.4.4 to make ES more robust in this
scenario or whether there's an issue around recovering dropped TCP
connections between nodes in more recent versions.
On Tuesday, 31 March 2015 03:33:18 UTC+1, Mark Walkom wrote:
I have two independent clusters running across more or less the same
machines. They're split across a pretty high bandwidth and relatively low
latency VPN link. One cluster is running v1.0.1 and seems to stay up all
the time. The other cluster is currently running 1.4.4 (and was running
1.4.2 before that) and seems to disconnect like clockwork every two hours.
The disconnect of the nodes on one side of the link is brief, they rejoin
and the recovery proceeds as normal. Any ideas what might cause this? Could
it be data related? The newer cluster has more indexes & shards than the
old, but the co-ordinators (3 of / min master count 2) don't seem
particularly stressed. Any thoughts on what, specifically to look for or
whether any particular setting or code change might make the cluster more
susceptible to disconnect when there's a minor / brief network connectivity
blip?
(and yes, I know multi-site isn't a recommended configuration - there
are other challenges for us with the tribe node approach too, though )
Further to this, the cluster that's failing has far more shards than the
one that stays up. We have a number of daily date stamped indexes, each
retaining 90 days of data amounting to 7500+ shards.
Looking through github, it looks like some underlying changes might have
impacted the scalability of clusters with larger numbers of shards ([STORE]
Cache fileLength for fully written files #9683, ES 1.4.2 random node
disconnect #9212 and *[STORE] Add simple cache for StoreStats #9709 *seem
related). Looks like there are some improvements in 1.5.0 (particulalry #9709). We're in the process of testing the upgrade (although there's a
regression too that's holding us up slightly - exists filters don't work on
indexes created prior to 1.3.0 Mappings: Fix _field_names to be disabled
on pre 1.3.0 indexes #10268 - fixed in 1.5.1 when it arrives). We have
also attempted to reduce the number of shards where possible as a short
term workaround. We also found that CPU usage is VERY high on the nodes
that stayed in cluster during recovery - we've dropped concurrent shard
recoveries to 1 (from the default of 3) to ease this situation.
We're also considering weekly or even monthly rolling indexes to reduce
shard count but that's a bigger code change in our application.
Okay, looks like we had a couple of issues going on here, both of which I
think we have resolved:
ES 1.4.4 was hammering CPU on the retained nodes during recovery, made
worse by the fact that we'd dropped multiple nodes rather than just one.
Reducing concurrent recoveries helped somewhat, but upgrading ES to 1.5.0
has returned CPU usage during recovery to a normal level.
The regular drop seems to be related to our VPN killing inactive TCP
connections (as suspected). It seems like termination of any one of the
many inter-node TCP connections results in all the other open and active
connections also being dropped (including ping and cluster state). Maybe
this behaviour has changed since the 1.0.* versions? Altering the OS TCP
keep-alive time down from 2 hours to 5 minutes has resolved the issue and
our cluster now remains up and stable. Be warned, though, the keep-alive
settings are at the OS (network interface) level and, on Windows at least,
required a full server reboot.
Hope this is useful to others and maybe can even go back into the pot for
2.0 resiliency (if it hasn't already) - better handling of
dropped/terminated but re-establishable connections?
I doubt this can be a task for 2.0 resiliency when the effect can be traced
to a W$ndows TCP/IP socket default behavior together with a possible
aggressive VPN implementation.
What may be worth an examination is why a connection stays open and does no
longer send keep alive packets, even though ES default configuration is
telling another story. Network status dumps of open TCP connections could
be helpful.
Okay, looks like we had a couple of issues going on here, both of which I
think we have resolved:
ES 1.4.4 was hammering CPU on the retained nodes during recovery, made
worse by the fact that we'd dropped multiple nodes rather than just one.
Reducing concurrent recoveries helped somewhat, but upgrading ES to 1.5.0
has returned CPU usage during recovery to a normal level.
The regular drop seems to be related to our VPN killing inactive TCP
connections (as suspected). It seems like termination of any one of the
many inter-node TCP connections results in all the other open and active
connections also being dropped (including ping and cluster state). Maybe
this behaviour has changed since the 1.0.* versions? Altering the OS TCP
keep-alive time down from 2 hours to 5 minutes has resolved the issue and
our cluster now remains up and stable. Be warned, though, the keep-alive
settings are at the OS (network interface) level and, on Windows at least,
required a full server reboot.
Hope this is useful to others and maybe can even go back into the pot for
2.0 resiliency (if it hasn't already) - better handling of
dropped/terminated but re-establishable connections?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.