Oops, replied to you directly by mistake. I've added this to the main
I didn't know about the low, medium and high connections. That's awesome.
I'll tweak that around to figure out the sweet spot.
I ran some tests and garbage collection does seem to be the problem. Quite
strange since my heap is consumed quite slowly. And GC is uniformly high;
shouldn't there be spikes only when the heap is being cleared out? Thread
pool counts seem fine to me. I don't think that's the issue.
I optimized a few queries, added a couple of nodes and closed a few really
small indices and things are a bit better now, though not ideal. I see your
point regarding the 30s ping timeout but I'm okay with that delay, atleast
for that moment. My bigger concern is making sure that nodes communicate
with each other and don't return an "invalid internal transport format"
messages. Thankfully, that's ceased to occur, though I can't for the life
of me figure out why they occurred in the first place.
Thanks a lot! This is very useful to me.
On Fri, Mar 15, 2013 at 2:36 PM, Kay Röpke firstname.lastname@example.org wrote:
On Mar 14, 2013, at 10:12 PM, Govind Chandrasekhar email@example.com
Thanks for that. Really insightful. A lot of my understanding of
ElasticSearch is fragmented and from the forum/code-base, so it's good to
get a consolidated explanation of the system.
So I'm assuming these regular pings happen over port 9300? Do they
"compete" with all the other traffic that runs between these nodes? The
problem I'm facing is that nodes get disconnected every once in a while due
to ping timeouts. I suspect that this might be due to heavy traffic on 9300
during merges / relocation / recovery / GC. I wonder if there's a way to
route these pings over a different port so that delayed ping responses
aren't misinterpreted as faults.
Yes, everything between the nodes happens on that port. I suppose one
could use a different transport plugin that could use a different mechanism
to do the inter-node communication. But I doubt this is actually a problem.
It's hard to tell if you have too much traffic between the nodes, but
don't be mislead by the fact that it listens on one port only. The Netty
transport actually opens more than one connection between the nodes, and
each node has different thread pools to handle different kinds of messages.
See NettyTransport.java for the connection logic (specifically
NodeChannels static class), TransportRequestOptions.java for the three
different priority connection types (LOW, MED, HIGH).
The number of connections per priority type is configurable:
Unfortunately none of that seems to be documented, it's "only" in the code.
I'm not sure if there's any kind of monitoring around the utilisation of
the connection types, but I have never seen it.
AFAICS there are a couple of things that might influence your problem, if
you've ruled out that garbage collection if the problem.
Either your servers have saturated the worker thread pools so that
messages are queueing up.
Or your network is actually slow at some point, which I doubt.
If it's the latter case, have a look around the various net.tcp tuning
parameters in elastic search. If its the first have a look at the worker
thread pool counts in the NettyTransport code file.
If you set the ping interval to 30s, and have the default of 3 retries
set, be aware that it might take up to 90s to actually detect a real
failure. That may or may not be acceptable in your case. But usually if
something goes wrong you want to have the system react quicker than 1.5
minutes. But YMMV.
As to the *invalid internal transport message format *error, I've only
ever seen that when there is a node running an incompatible version of
ElasticSearch in the cluster.
Sometimes this means that even a different minor version can lead to an
incompatible message format.
Why there is no version check for each and every message is beyond me,
especially since it's so easy to do, and it could be really efficient, too.
The error message at least is not really helpful.
Therefore double-check your installed versions.
Note that the version problem also affects your clients if you are using
the node/transport client mode of the Java client. JSON over HTTP is
unaffected of course.
btw, did you sent the message to just me on purpose or by accident? maybe
it would be helpful to have it on the list.
if it was by accident, please feel free to sent to the list again.
A bit more of a background; initially, the log messages were as follows:
*[2013-03-13 21:41:07,724][WARN ][transport.netty ]
[elasticsearch-server-8] exception caught on transport layer [[id:
0xc318e992, /10.159.57.146:47178 => /10.13.11.8:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message
*[2013-03-13 21:52:59,528][DEBUG][transport.netty ]
[elasticsearch-server-8] disconnected from
aws_availability_zone=us-east-1d, max_local_storage_nodes=1, master=false}]
[elasticsearch-server-8] failed to execute on node [pgwPEMS3ToePDEbOv3bnLQ]
but these subsided after I turned off multicast (which was enabled by
default), something that isn't supported by AWS and increased my ping
interval to 30s (I don't understand the correlation between zen multicast
discovery and "invalid internal transport" messages though). Now, it's just
the ping timeouts that're causing disconnections.
On Wednesday, March 13, 2013 5:55:28 PM UTC-7, Kay Röpke wrote:
Since I had to dive into that code at some point to answer the same
questions, here's what I remember:
Firstly, discovery and fault detection are really two different things,
meaning that the fact that you used Zen Discovery to figure out which nodes
exist doesn't affect what's happening during the fault detection phase.
After the initial round of discovery and master election, the master will
ping each node using the configured values of ping interval, retries, and
ping timeout. Those are 1sec, 3 retries, and 30 seconds respectively.
That code is in NodesFaultDetection.java.
Each node in the cluster knows which other nodes exist in the cluster, and
which node is supposed to be the master. Thus each node makes sure it keeps
knowing the master and therefore pings it with the same config values.
This code is in MasterFaultDetection.java.
I actually added log statements to the code to figure out what's going on.
The most surprising thing was that only the master pings each node in the
cluster (except itself of course). The other nodes only ping the master
and rely on published cluster state from the master (or the newly elected
master in case of a partition).
In case you are wondering how master election works, it's really simple.
Each node has assigned itself a random identifier at startup, and each node
keeps a list of all known identifiers in the cluster. That list is sorted
lexically, and the first, alive node listed is the new master. Since all
nodes use the same list (it's part of the cluster state) and the same sort
order this works.
However, in split brain situations, this leads to problems, because two
nodes might rightfully think they are the new master.
That is especially bad in case you use unicast discovery, because I've
found that ElasticSearch will make no attempt to reconnect to nodes it
deems to be dead after a partition and if you haven't configured
discovery.zen.minimum_master_nodes correctly (especially if you left it on
the default of 1, because then each node would be happy to form its own
I might be wrong about the last part and higher values of
minimum_master_nodes, the code does look like it's attempting to reconnect
if it completely lost the connections.
However, and this is the problematic part we've seen before, if your split
brains and ping timeouts happen because nodes run into long garbage
collection pauses, without actually having OOM errors, nodes might think
they are seeing other nodes, and especially being considered to be in two
parts of the cluster. This leads to really funky situations where the
cluster can be non-functional, making no progress whatsoever, but also
appear partially "online".
So far I've only seen that with extreme garbage collection problems, and
not with actual network level partitions. But those are rare anyway and
cause all kinds of other problems, so it might've been masked.
That's what I remember from it, not sure what problem you are actually
trying to solve here.
Did you find the answer to your question? I'm curious about this too.
On Monday, October 8, 2012 12:54:44 PM UTC-7, arta wrote:
I'm having problems that sometimes the ES cluster thinks it has lost one
its nodes while it is still there.
As a part of investigation, I'm monitoring multicast ping on
I see pings when a node starts, but after that I don't see periodical
My guess is that once a node is detected, port 9300 is used for heartbeat
similar mechanism to detect the existence of the node, but it is just a
Can anybody give me a outline how zen discovery and fault detection
Or if you could point me in the direction of any guide to the different
states/phases of zen discovery it would be greatly appreciated.
View this message in context:
Sent from the ElasticSearch Users mailing list archive at Nabble.com.
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
For more options, visit https://groups.google.com/groups/opt_out.