I've done extensive research over the last 18 month on elasticsearch and am
currently in the middle of a 2 year old project that has yet to launch but
is set to launch in May. From the initial investigation I discovered that
there are several types of nodes in a cluster (master, client, data) and
these can be combined to build nodes that meet the needs of your cluster.
Since my initial investigation occurred over a year ago and this project
keeps evolving at an incredible rate I have noticed that my understanding
of these concepts was either inaccurate from the onset or the concepts have
changed enough that either way I need to regain some clarification.
Here are my original understanding of these:
Master: A node that receives all request for a cluster as well as managed
the state of the cluster
Client: A node that makes request of the cluster and it's indices
Data: A node that stores index data
I expected then that my master nodes would be configured and available for
my clients to access in order to query the data nodes.
I expected then that my client nodes would only need to know about my
master nodes
I expected then that my data nodes would only need to know about my master
nodes
I gathered most of this information (possibly inaccurately) from the online
documentation as well as online video explanations like these:
http://www.elasticsearch.org/videos/2010/02/08/es-distributed-diagram.html
http://www.elasticsearch.org/videos/2012/06/05/scaling-massive-elasticsearch-clusters.html
Now I have gotten more information and it has made me even more confused.
It seems that depending on your chosen discovery method your knowledge of
other nodes is managed differently
- Multicast - only need to know about the multicast network (the rest is
discovered) - Unicast - need to know about every node (not really any discovery)
It also seems that a master node designation is only for determining which
node will keep the metadata and state information about the cluster and
it's indices and has no bearing on where client nodes should access the
cluster
It also seems that a client node can access the cluster via any other node
in the cluster (so if your cluster has 5 nodes and a client comes online it
could access the cluster through any of the 5 servers, maybe even all 5 of
them at some point)
I'm looking at an optimal configuration that offers the most options for
vertical and horizontal scaling as well as performance coming from my
client nodes. I'm currently using the transport client on a unicast
cluster, where my clients are only aware of the master nodes and none
other. I will have 100 million documents or more across a dozen or more
indices
It has been mentioned that using the NodeBuilder to connect directly to the
cluster would be more performant. As well, setting up multicast instead of
unicast to offer complete ignorance in the configuration of a node.
I'm just confused and am looking for clear documentation on how all this
works so I can make a clear decision as to how to configure my cluster. If
documentation doesn't exist and someone can just clear it up in this post
that would be great as well.
Thanks
Wes
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.