How request flow through cluster

I've done extensive research over the last 18 month on elasticsearch and am
currently in the middle of a 2 year old project that has yet to launch but
is set to launch in May. From the initial investigation I discovered that
there are several types of nodes in a cluster (master, client, data) and
these can be combined to build nodes that meet the needs of your cluster.
Since my initial investigation occurred over a year ago and this project
keeps evolving at an incredible rate I have noticed that my understanding
of these concepts was either inaccurate from the onset or the concepts have
changed enough that either way I need to regain some clarification.

Here are my original understanding of these:

Master: A node that receives all request for a cluster as well as managed
the state of the cluster
Client: A node that makes request of the cluster and it's indices
Data: A node that stores index data

I expected then that my master nodes would be configured and available for
my clients to access in order to query the data nodes.
I expected then that my client nodes would only need to know about my
master nodes
I expected then that my data nodes would only need to know about my master
nodes

I gathered most of this information (possibly inaccurately) from the online
documentation as well as online video explanations like these:

http://www.elasticsearch.org/videos/2010/02/08/es-distributed-diagram.html
http://www.elasticsearch.org/videos/2012/06/05/scaling-massive-elasticsearch-clusters.html

Now I have gotten more information and it has made me even more confused.

It seems that depending on your chosen discovery method your knowledge of
other nodes is managed differently

  1. Multicast - only need to know about the multicast network (the rest is
    discovered)
  2. Unicast - need to know about every node (not really any discovery)

It also seems that a master node designation is only for determining which
node will keep the metadata and state information about the cluster and
it's indices and has no bearing on where client nodes should access the
cluster

It also seems that a client node can access the cluster via any other node
in the cluster (so if your cluster has 5 nodes and a client comes online it
could access the cluster through any of the 5 servers, maybe even all 5 of
them at some point)

I'm looking at an optimal configuration that offers the most options for
vertical and horizontal scaling as well as performance coming from my
client nodes. I'm currently using the transport client on a unicast
cluster, where my clients are only aware of the master nodes and none
other. I will have 100 million documents or more across a dozen or more
indices

It has been mentioned that using the NodeBuilder to connect directly to the
cluster would be more performant. As well, setting up multicast instead of
unicast to offer complete ignorance in the configuration of a node.

I'm just confused and am looking for clear documentation on how all this
works so I can make a clear decision as to how to configure my cluster. If
documentation doesn't exist and someone can just clear it up in this post
that would be great as well.

Thanks
Wes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Am 12.03.13 19:39, schrieb Wes Plunk:

[...]

Now I have gotten more information and it has made me even more
confused.

It seems that depending on your chosen discovery method your knowledge
of other nodes is managed differently

  1. Multicast - only need to know about the multicast network (the rest
    is discovered)
  2. Unicast - need to know about every node (not really any discovery)

It also seems that a master node designation is only for determining
which node will keep the metadata and state information about the
cluster and it's indices and has no bearing on where client nodes
should access the cluster

True.

It also seems that a client node can access the cluster via any other
node in the cluster (so if your cluster has 5 nodes and a client comes
online it could access the cluster through any of the 5 servers, maybe
even all 5 of them at some point)

Correct.

I'm looking at an optimal configuration that offers the most options
for vertical and horizontal scaling as well as performance coming from
my client nodes. I'm currently using the transport client on a unicast
cluster, where my clients are only aware of the master nodes and none
other. I will have 100 million documents or more across a dozen or
more indices

It has been mentioned that using the NodeBuilder to connect directly
to the cluster would be more performant.

No matter if it's a node client or a transport client, both must wait
for the completion of an index/search operation that is distributed to
nodes in the whole cluster. With the node client, you can execute client
commands on the same JVM as a data node. With the transport client, you
can connect from remote, which might be more performant because you can
separate the extra load in your application away from a data node.

As well, setting up multicast instead of unicast to offer complete
ignorance in the configuration of a node.

Multicast is preferable in the sense that you don't need to worry about
how to add nodes to the cluster. It eases the administration. Just set
up a private network and you can add your ES servers there.

I'm just confused and am looking for clear documentation on how all
this works so I can make a clear decision as to how to configure my
cluster. If documentation doesn't exist and someone can just clear it
up in this post that would be great as well.

I had also to experiment a little bit in the beginning and I delved into
the code. The code is what really shines! With background knowledge of
distributed search engine architecture from commercial products and
their shortcomings I found really quick what the real improvements of ES
are: elegant code and for managing distributed JVMs.

The most fascinating thing in ES is that every node is equal by default.
I love this symmetry. You just start nodes and they build a homogeneous
cluster - and each node is holding a full copy of the cluster state.
There is no hierarchy such as master/slave described in literature.
Master/slave bottlenecks on data transport level are simply not
existant. In ES, the "master" is just a "leader" node, that is
responsible to write the state modifications first before it propagates
to the other nodes, with the consequence the leader role can switch to
another node in almost no time. Hence, a "leader" can be taken down
without any precautions, and the cluster continues to run.

With using the cluster state, Index/seach operations are immediately
distributed from the client node to the nodes that hold the involved
shards, i.e. nothing is routed through a single node. Also, unrelated
nodes are not hit by unnecessary load. And that's why ES scales so well :slight_smile:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I also love the concept of homogeneous nodes in Elasticsearch. Anyone who
has tried to setup up shard MongoDB (or even MySQL) can attest to this
great benefit. But I wonder how much resiliency is lost without having a
coordinator such as ZooKeeper? There still are lingering split brain issues
with Elasticsearch (
minimum_master_nodes does not prevent split-brain if splits are intersecting · Issue #2488 · elastic/elasticsearch · GitHub,
split brain condition after second network disconnect - even with minimum_master_nodes set · Issue #2117 · elastic/elasticsearch · GitHub) that are caused
precisely because there is no authoritative source for master election.
Difficult problem for sure, but Elasticsearch works like a dream 99.9% of
the time. :slight_smile:

I would love to see thorough documentation and use-cases for using separate
client/data nodes in large clusters.

--
Ivan

On Tue, Mar 12, 2013 at 1:17 PM, Jörg Prante joergprante@gmail.com wrote:

The most fascinating thing in ES is that every node is equal by default. I
love this symmetry. You just start nodes and they build a homogeneous
cluster - and each node is holding a full copy of the cluster state. There
is no hierarchy such as master/slave described in literature. Master/slave
bottlenecks on data transport level are simply not existant. In ES, the
"master" is just a "leader" node, that is responsible to write the state
modifications first before it propagates to the other nodes, with the
consequence the leader role can switch to another node in almost no time.
Hence, a "leader" can be taken down without any precautions, and the
cluster continues to run.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm pretty sure that at some point in near future we will see more
leader election algorithms in ES. Looking at
Leader election - Wikipedia there are some options.

Jörg

Am 12.03.13 23:15, schrieb Ivan Brusic:

I also love the concept of homogeneous nodes in Elasticsearch. Anyone
who has tried to setup up shard MongoDB (or even MySQL) can attest to
this great benefit. But I wonder how much resiliency is lost without
having a coordinator such as ZooKeeper? There still are lingering
split brain issues with Elasticsearch
(minimum_master_nodes does not prevent split-brain if splits are intersecting · Issue #2488 · elastic/elasticsearch · GitHub,
split brain condition after second network disconnect - even with minimum_master_nodes set · Issue #2117 · elastic/elasticsearch · GitHub) that are
caused precisely because there is no authoritative source for master
election. Difficult problem for sure, but Elasticsearch works like a
dream 99.9% of the time. :slight_smile:

I would love to see thorough documentation and use-cases for using
separate client/data nodes in large clusters.

--
Ivan

On Tue, Mar 12, 2013 at 1:17 PM, Jörg Prante <joergprante@gmail.com
mailto:joergprante@gmail.com> wrote:

The most fascinating thing in ES is that every node is equal by
default. I love this symmetry. You just start nodes and they build
a homogeneous cluster - and each node is holding a full copy of
the cluster state. There is no hierarchy such as master/slave
described in literature. Master/slave bottlenecks on data
transport level are simply not existant. In ES, the "master" is
just a "leader" node, that is responsible to write the state
modifications first before it propagates to the other nodes, with
the consequence the leader role can switch to another node in
almost no time. Hence, a "leader" can be taken down without any
precautions, and the cluster continues to run.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.