Temporary Network Lag Failover

Benjamin_Vetter · September 17, 2013, 7:19am

In my hypothetical 3 nodes, 5 shards, 1 replica setup: if there is some
temporary (let's say 30 min or so) network lag (the node answers to pings
etc, but only slowly/delayed/etc) between 2 nodes,
the cluster's search and other kinds of response times will go up kind of
arbitrarily, ... is this correct?

I'd obviously like to keep at least the search response times as fast as
possible such that the search response times won't be harmed by the laggy
box for the full 30 min of lagginess, but how?

The only two ways i can imagine is to a) remove the laggy box from the
cluster by setting a much lower discovery.zen.fd.ping_timeout,
but that does not seem to be a good idea ... or b) to temporarily exclude
the box from the search by setting ?preference=shards:1,2,3,...
such that shards located on the laggy box won't be listed. The boxes could
e.g. be monitored/pinged externally (not in ES) and the results written to
redis or similar.

Is this a reasonable idea or do you know of something better?

To implement such a thing it would be really really helpful to have a
_only_nodes preference where you can list multiple node ids
or better: ip addresses of the boxes you want only search on (feature
request?!).
I think it would generally be a good idea to allow ip addresses to be used
in addition to node id's for the _only_node and _prefer_node preferences.

Thank you very much for ES and your help

-- benjamin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · September 17, 2013, 8:48am

Network lags are external challenges, and it is quite a bad idea creating a
distributed, scalable, reliable and fast search system if such
circumstances are known beforehand.

The internal assumption of ES architecture is a local network of nodes
where the latency is low. There are no precautions for operating in a
degraded environment or in a high latency environment - everything relies
on the presence of nodes that hold the necessary shards. If nodes go away,
a recovery is started to replace the presence of missing shards, to ensure
completeness of results and fast responses.

Because network lags are unpredictable, the affected nodes are
unpredictable. Things like _onlynode parameters, which would enforce
degraded or null result sets instead of accepting the shard replacement
strategy by the replica mechanism, would be very confusing to use. It would
mean also establishing a principle of brokenness which annoys the user.

Instead of playing tricks, it is always a better option to put the network
state into a good condition so the cluster can operate in a healthy green
state.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Benjamin_Vetter · September 17, 2013, 9:34am

Thanks for your answer. However, i can't fully agree. It's not that easy to
have the network state in a good condition 100% of the time.
If you rent servers at eg. Hetzner, AWS or arbitrary other hosting
companies you will definitely see temporary network issues and lags,
because there is currently eg. some sort of attack or other issue going on
in the network segment one of your servers sits in - and you can't do
anything about it.
The more servers, the more risk (unless you have so many servers that only
a small percentage of your users will be affected - but that's a lot of
servers).
Instead of ignoring the risk, i want to reduce its possible harm.

I'd not say i want to play tricks. Instead, i want to add some sort of QoS
layer regarding search response time on top of ES.
Isn't that reasonable?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · September 17, 2013, 1:21pm

If you cluster topology has some sort of structure (e.g. two racks, or two
different rooms in a data-center, etc) you can use Allocation Awareness to
help move the load over the the section of your cluster that is unaffected.
Search is preferentially served from boxes with the same tag as the
coordinating node.

If you don't have any known structure (just a bag of nodes), it's very hard
to do this sort of operation. Allocation awareness ensures that you have a
copy of data in each section of your cluster...but if you don't know about
those structures a priori it isn't possible to load balance. These
settings are dynamic, so you could always add the tags once you notice a
lag problem, but now you are adding shard relocation stress on top of an
already laggy node. Depending on your cluster topology and replica
distribution, you could disable shard allocations too, which would prevent
moving shards around.

If you fully replicate your data such that each node has a copy of all
shards, you can service requests with ?preference=_local, which will use
local shard data where available. Fully replicating your data may be too
large of a cost, however, and not something you can do.

Your suggestion of an only setting seems dangerous: search results will
be degraded (which is tolerable) but indexing requests will start to fail
when they need the primary shards held on that node. If this is the
behavior you want, simply shutting down the node would be easier/more
robust.

-Zach

On Tuesday, September 17, 2013 5:34:52 AM UTC-4, Benjamin Vetter wrote:

Thanks for your answer. However, i can't fully agree. It's not that easy
to have the network state in a good condition 100% of the time.
If you rent servers at eg. Hetzner, AWS or arbitrary other hosting
companies you will definitely see temporary network issues and lags,
because there is currently eg. some sort of attack or other issue going on
in the network segment one of your servers sits in - and you can't do
anything about it.
The more servers, the more risk (unless you have so many servers that only
a small percentage of your users will be affected - but that's a lot of
servers).
Instead of ignoring the risk, i want to reduce its possible harm.

I'd not say i want to play tricks. Instead, i want to add some sort of QoS
layer regarding search response time on top of ES.
Isn't that reasonable?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Replication unnecessarily routing requests over other nodes + timeouts Elasticsearch	1	366	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
Cluster Hangs for 20 seconds, on a single node crush Elasticsearch	13	892	October 3, 2019
Questions about the timeout search option Elasticsearch	1	389	July 6, 2017
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017

Temporary Network Lag Failover

Related topics