Multicast ping vs port 9300

Hi,
I'm having problems that sometimes the ES cluster thinks it has lost one of its nodes while it is still there.
As a part of investigation, I'm monitoring multicast ping on 244.2.4.4:54328.
I see pings when a node starts, but after that I don't see periodical pings.
My guess is that once a node is detected, port 9300 is used for heartbeat or similar mechanism to detect the existence of the node, but it is just a guess.

Can anybody give me a outline how zen discovery and fault detection actually work together?
Or if you could point me in the direction of any guide to the different states/phases of zen discovery it would be greatly appreciated.

Hey,
Did you find the answer to your question? I'm curious about this too.

On Monday, October 8, 2012 12:54:44 PM UTC-7, arta wrote:

Hi,
I'm having problems that sometimes the ES cluster thinks it has lost one
of
its nodes while it is still there.
As a part of investigation, I'm monitoring multicast ping on
244.2.4.4:54328.
I see pings when a node starts, but after that I don't see periodical
pings.
My guess is that once a node is detected, port 9300 is used for heartbeat
or
similar mechanism to detect the existence of the node, but it is just a
guess.

Can anybody give me a outline how zen discovery and fault detection
actually
work together?
Or if you could point me in the direction of any guide to the different
states/phases of zen discovery it would be greatly appreciated.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/multicast-ping-vs-port-9300-tp4023645.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi!

Since I had to dive into that code at some point to answer the same questions, here's what I remember:

Firstly, discovery and fault detection are really two different things, meaning that the fact that you used Zen Discovery to figure out which nodes exist doesn't affect what's happening during the fault detection phase.

After the initial round of discovery and master election, the master will ping each node using the configured values of ping interval, retries, and ping timeout. Those are 1sec, 3 retries, and 30 seconds respectively.
That code is in NodesFaultDetection.java.
Each node in the cluster knows which other nodes exist in the cluster, and which node is supposed to be the master. Thus each node makes sure it keeps knowing the master and therefore pings it with the same config values.
This code is in MasterFaultDetection.java.

I actually added log statements to the code to figure out what's going on.
The most surprising thing was that only the master pings each node in the cluster (except itself of course). The other nodes only ping the master and rely on published cluster state from the master (or the newly elected master in case of a partition).
In case you are wondering how master election works, it's really simple. Each node has assigned itself a random identifier at startup, and each node keeps a list of all known identifiers in the cluster. That list is sorted lexically, and the first, alive node listed is the new master. Since all nodes use the same list (it's part of the cluster state) and the same sort order this works.
However, in split brain situations, this leads to problems, because two nodes might rightfully think they are the new master.

That is especially bad in case you use unicast discovery, because I've found that Elasticsearch will make no attempt to reconnect to nodes it deems to be dead after a partition and if you haven't configured discovery.zen.minimum_master_nodes correctly (especially if you left it on the default of 1, because then each node would be happy to form its own cluster).

I might be wrong about the last part and higher values of minimum_master_nodes, the code does look like it's attempting to reconnect if it completely lost the connections.

However, and this is the problematic part we've seen before, if your split brains and ping timeouts happen because nodes run into long garbage collection pauses, without actually having OOM errors, nodes might think they are seeing other nodes, and especially being considered to be in two parts of the cluster. This leads to really funky situations where the cluster can be non-functional, making no progress whatsoever, but also appear partially "online".
So far I've only seen that with extreme garbage collection problems, and not with actual network level partitions. But those are rare anyway and cause all kinds of other problems, so it might've been masked.

That's what I remember from it, not sure what problem you are actually trying to solve here.

best,
-k

On Mar 14, 2013, at 1:20 AM, govind201 govind@semantics3.com wrote:

Hey,
Did you find the answer to your question? I'm curious about this too.

On Monday, October 8, 2012 12:54:44 PM UTC-7, arta wrote:
Hi,
I'm having problems that sometimes the ES cluster thinks it has lost one of
its nodes while it is still there.
As a part of investigation, I'm monitoring multicast ping on
244.2.4.4:54328.
I see pings when a node starts, but after that I don't see periodical pings.
My guess is that once a node is detected, port 9300 is used for heartbeat or
similar mechanism to detect the existence of the node, but it is just a
guess.

Can anybody give me a outline how zen discovery and fault detection actually
work together?
Or if you could point me in the direction of any guide to the different
states/phases of zen discovery it would be greatly appreciated.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/multicast-ping-vs-port-9300-tp4023645.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Kay,

Oops, replied to you directly by mistake. I've added this to the main
thread now.

I didn't know about the low, medium and high connections. That's awesome.
I'll tweak that around to figure out the sweet spot.

I ran some tests and garbage collection does seem to be the problem. Quite
strange since my heap is consumed quite slowly. And GC is uniformly high;
shouldn't there be spikes only when the heap is being cleared out? Thread
pool counts seem fine to me. I don't think that's the issue.

I optimized a few queries, added a couple of nodes and closed a few really
small indices and things are a bit better now, though not ideal. I see your
point regarding the 30s ping timeout but I'm okay with that delay, atleast
for that moment. My bigger concern is making sure that nodes communicate
with each other and don't return an "invalid internal transport format"
messages. Thankfully, that's ceased to occur, though I can't for the life
of me figure out why they occurred in the first place.

Thanks a lot! This is very useful to me.

Best,
Govind

On Fri, Mar 15, 2013 at 2:36 PM, Kay Röpke kroepke@gmail.com wrote:

On Mar 14, 2013, at 10:12 PM, Govind Chandrasekhar govind@semantics3.com
wrote:

Hey,

Thanks for that. Really insightful. A lot of my understanding of
Elasticsearch is fragmented and from the forum/code-base, so it's good to
get a consolidated explanation of the system.

So I'm assuming these regular pings happen over port 9300? Do they
"compete" with all the other traffic that runs between these nodes? The
problem I'm facing is that nodes get disconnected every once in a while due
to ping timeouts. I suspect that this might be due to heavy traffic on 9300
during merges / relocation / recovery / GC. I wonder if there's a way to
route these pings over a different port so that delayed ping responses
aren't misinterpreted as faults.

Yes, everything between the nodes happens on that port. I suppose one
could use a different transport plugin that could use a different mechanism
to do the inter-node communication. But I doubt this is actually a problem.

It's hard to tell if you have too much traffic between the nodes, but
don't be mislead by the fact that it listens on one port only. The Netty
transport actually opens more than one connection between the nodes, and
each node has different thread pools to handle different kinds of messages.
See NettyTransport.java for the connection logic (specifically
NodeChannels static class), TransportRequestOptions.java for the three
different priority connection types (LOW, MED, HIGH).

The number of connections per priority type is configurable:
http://goo.gl/swrzX

Unfortunately none of that seems to be documented, it's "only" in the code.
I'm not sure if there's any kind of monitoring around the utilisation of
the connection types, but I have never seen it.

AFAICS there are a couple of things that might influence your problem, if
you've ruled out that garbage collection if the problem.
Either your servers have saturated the worker thread pools so that
messages are queueing up.
Or your network is actually slow at some point, which I doubt.

If it's the latter case, have a look around the various net.tcp tuning
parameters in Elasticsearch. If its the first have a look at the worker
thread pool counts in the NettyTransport code file.

If you set the ping interval to 30s, and have the default of 3 retries
set, be aware that it might take up to 90s to actually detect a real
failure. That may or may not be acceptable in your case. But usually if
something goes wrong you want to have the system react quicker than 1.5
minutes. But YMMV.

As to the *invalid internal transport message format *error, I've only
ever seen that when there is a node running an incompatible version of
Elasticsearch in the cluster.
Sometimes this means that even a different minor version can lead to an
incompatible message format.
Why there is no version check for each and every message is beyond me,
especially since it's so easy to do, and it could be really efficient, too.
The error message at least is not really helpful.
Therefore double-check your installed versions.

Note that the version problem also affects your clients if you are using
the node/transport client mode of the Java client. JSON over HTTP is
unaffected of course.

best,
-k

btw, did you sent the message to just me on purpose or by accident? maybe
it would be helpful to have it on the list.
if it was by accident, please feel free to sent to the list again.

A bit more of a background; initially, the log messages were as follows:
*[2013-03-13 21:41:07,724][WARN ][transport.netty ]
[elasticsearch-server-8] exception caught on transport layer [[id:
0xc318e992, /10.159.57.146:47178 => /10.13.11.8:9300]], closing connection
*
java.io.StreamCorruptedException: invalid internal transport message
format

*at
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*

  •    at 
    

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*

and
*[2013-03-13 21:52:59,528][DEBUG][transport.netty ]
[elasticsearch-server-8] disconnected from
[[elasticsearch-server-spot-d13631627551][pgwPEMS3ToePDEbOv3bnLQ][inet[/10.159.57.146:9300]]{rack_id=rack_four,
aws_availability_zone=us-east-1d, max_local_storage_nodes=1, master=false}]
*
*[2013-03-13 21:52:59,551][DEBUG][action.admin.cluster.node.info]
[elasticsearch-server-8] failed to execute on node [pgwPEMS3ToePDEbOv3bnLQ]
*
org.elasticsearch.transport.NodeDisconnectedException:
[elasticsearch-server-spot-d13631627551][inet[/10.159.57.146:9300]][cluster/nodes/info/n]
disconnected

but these subsided after I turned off multicast (which was enabled by
default), something that isn't supported by AWS and increased my ping
interval to 30s (I don't understand the correlation between zen multicast
discovery and "invalid internal transport" messages though). Now, it's just
the ping timeouts that're causing disconnections.

Thanks :slight_smile:

On Wednesday, March 13, 2013 5:55:28 PM UTC-7, Kay Röpke wrote:

Hi!

Since I had to dive into that code at some point to answer the same
questions, here's what I remember:

Firstly, discovery and fault detection are really two different things,
meaning that the fact that you used Zen Discovery to figure out which nodes
exist doesn't affect what's happening during the fault detection phase.

After the initial round of discovery and master election, the master will
ping each node using the configured values of ping interval, retries, and
ping timeout. Those are 1sec, 3 retries, and 30 seconds respectively.
That code is in NodesFaultDetection.java.
Each node in the cluster knows which other nodes exist in the cluster, and
which node is supposed to be the master. Thus each node makes sure it keeps
knowing the master and therefore pings it with the same config values.
This code is in MasterFaultDetection.java.

I actually added log statements to the code to figure out what's going on.
The most surprising thing was that only the master pings each node in the
cluster (except itself of course). The other nodes only ping the master
and rely on published cluster state from the master (or the newly elected
master in case of a partition).
In case you are wondering how master election works, it's really simple.
Each node has assigned itself a random identifier at startup, and each node
keeps a list of all known identifiers in the cluster. That list is sorted
lexically, and the first, alive node listed is the new master. Since all
nodes use the same list (it's part of the cluster state) and the same sort
order this works.
However, in split brain situations, this leads to problems, because two
nodes might rightfully think they are the new master.

That is especially bad in case you use unicast discovery, because I've
found that Elasticsearch will make no attempt to reconnect to nodes it
deems to be dead after a partition and if you haven't configured
discovery.zen.minimum_master_nodes correctly (especially if you left it on
the default of 1, because then each node would be happy to form its own
cluster).

I might be wrong about the last part and higher values of
minimum_master_nodes, the code does look like it's attempting to reconnect
if it completely lost the connections.

However, and this is the problematic part we've seen before, if your split
brains and ping timeouts happen because nodes run into long garbage
collection pauses, without actually having OOM errors, nodes might think
they are seeing other nodes, and especially being considered to be in two
parts of the cluster. This leads to really funky situations where the
cluster can be non-functional, making no progress whatsoever, but also
appear partially "online".
So far I've only seen that with extreme garbage collection problems, and
not with actual network level partitions. But those are rare anyway and
cause all kinds of other problems, so it might've been masked.

That's what I remember from it, not sure what problem you are actually
trying to solve here.

best,
-k

On Mar 14, 2013, at 1:20 AM, govind201 <gov...@semantics3.com<javascript:>>
wrote:

Hey,
Did you find the answer to your question? I'm curious about this too.

On Monday, October 8, 2012 12:54:44 PM UTC-7, arta wrote:

Hi,
I'm having problems that sometimes the ES cluster thinks it has lost one
of
its nodes while it is still there.
As a part of investigation, I'm monitoring multicast ping on
244.2.4.4:54328.
I see pings when a node starts, but after that I don't see periodical
pings.
My guess is that once a node is detected, port 9300 is used for heartbeat
or
similar mechanism to detect the existence of the node, but it is just a
guess.

Can anybody give me a outline how zen discovery and fault detection
actually
work together?
Or if you could point me in the direction of any guide to the different
states/phases of zen discovery it would be greatly appreciated.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/multicast-ping-vs-port-9300-tp4023645.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.