A node in a cluster can be configured to serve exclusively as: a data,
master or "client" node. By deciding a node's role a-priori, I suspect one
should tweak its hardware (more RAM, less CPU, etc.) to fit. What would be
the (relative) recommendations per role?
data nodes (not master eligible, not serving requests):
lots of RAM for faceting and sorting
lots of CPU for indexing, analyzing and querying
master nodes (no data, not serving requests):
seems it's not doing much... lazy master nodes!
client nodes (not master eligible, no data):
lots of CPU for aggregating and serving lots of parallel requests
is RAM still important here (is it affected by faceting and sorting?)
If there are several master-only nodes, are they all idle except for one at
any given time?
Anyone have experience in deploying a cluster with "load-balancing" clients
for serving requests?
A node in a cluster can be configured to serve exclusively as: a data,
master or "client" node. By deciding a node's role a-priori, I suspect one
should tweak its hardware (more RAM, less CPU, etc.) to fit. What would be
the (relative) recommendations per role?
data nodes (not master eligible, not serving requests):
lots of RAM for faceting and sorting
lots of CPU for indexing, analyzing and querying
master nodes (no data, not serving requests):
seems it's not doing much... lazy master nodes!
client nodes (not master eligible, no data):
lots of CPU for aggregating and serving lots of parallel requests
is RAM still important here (is it affected by faceting and sorting?)
If there are several master-only nodes, are they all idle except for one at
any given time?
Anyone have experience in deploying a cluster with "load-balancing" clients
for serving requests?
Yes, I'm aware that all nodes are equivalent by default (and they elect a
master node themselves), but by changing the default settings, you can make
a node not master eligible (node.master=false or node.client=true) and you
can decide whether a node has data (node.data).
Using node.data=false and node.client=true, you're effectively creating a
node that will not serve indices directly, but will redirect requests to
"data nodes" and aggregate the results. I'm wondering if there are any
advantages in creating such nodes. For example, does this change the
requirements on hardware (requires less RAM, no disk access, etc.) If so,
one can create a cluster topology and scale data nodes and client nodes
independently.
Maybe this only introduces additional complexity, but for cloud-based
solutions such as EC2, it may be interesting to have different types of
nodes to have greater flexibility for choosing instance types.
A node in a cluster can be configured to serve exclusively as: a data,
master or "client" node. By deciding a node's role a-priori, I suspect
one
should tweak its hardware (more RAM, less CPU, etc.) to fit. What would
be
the (relative) recommendations per role?
data nodes (not master eligible, not serving requests):
lots of RAM for faceting and sorting
lots of CPU for indexing, analyzing and querying
master nodes (no data, not serving requests):
seems it's not doing much... lazy master nodes!
client nodes (not master eligible, no data):
lots of CPU for aggregating and serving lots of parallel requests
is RAM still important here (is it affected by faceting and sorting?)
If there are several master-only nodes, are they all idle except for one
at
any given time?
Anyone have experience in deploying a cluster with "load-balancing"
clients
for serving requests?
Yes, I'm aware that all nodes are equivalent by default (and they elect a
master node themselves), but by changing the default settings, you can make
a node not master eligible (node.master=false or node.client=true) and you
can decide whether a node has data (node.data).
Using node.data=false and node.client=true, you're effectively creating a
node that will not serve indices directly, but will redirect requests to
"data nodes" and aggregate the results.
why do you think that adding a separate no-data node would be
beneficial? what should be the advantages overs directing the queries
directly to a data node? As the data node needs to process the query
nevertheless.
Having just "load balancing nodes" (non master, non data) will not help
that much. I have seen cases where it was used to run it locally with the
relevant client code for HTTP access since it was connecting over loopback
and ES had better network handling for remote access to nodes. "Just" data
nodes still serve requests, even if they are coming from client nodes.
Dedicated master nodes can become handy in certain situations. For very
large clusters they can help (i.e. 200 data nodes with 3 "eligible" master
nodes).
Yes, I'm aware that all nodes are equivalent by default (and they elect a
master node themselves), but by changing the default settings, you can
make
a node not master eligible (node.master=false or node.client=true) and
you
can decide whether a node has data (node.data).
Using node.data=false and node.client=true, you're effectively creating a
node that will not serve indices directly, but will redirect requests to
"data nodes" and aggregate the results.
why do you think that adding a separate no-data node would be
beneficial? what should be the advantages overs directing the queries
directly to a data node? As the data node needs to process the query
nevertheless.
why do you think that adding a separate no-data node would be
beneficial? what should be the advantages overs directing the queries
directly to a data node? As the data node needs to process the query
nevertheless.
I was wondering if putting several client-only nodes "in front" of the data
nodes would be beneficial by offloading some work from data nodes. These
nodes would act like load-balancing nodes and maybe would require different
type of hardware (less RAM, more CPU, no disk access, for example).
The data nodes still have to process the query, but they wouldn't have to
aggregate the results.
In your example, how does having 3 "master eligible" nodes help in a 200
node cluster? Is this for avoiding split brain situations?
Thanks,
Philippe
On Sat, Jan 7, 2012 at 14:48, Shay Banon kimchy@gmail.com wrote:
Having just "load balancing nodes" (non master, non data) will not help
that much. I have seen cases where it was used to run it locally with the
relevant client code for HTTP access since it was connecting over loopback
and ES had better network handling for remote access to nodes. "Just" data
nodes still serve requests, even if they are coming from client nodes.
Dedicated master nodes can become handy in certain situations. For very
large clusters they can help (i.e. 200 data nodes with 3 "eligible" master
nodes).
Yes, I'm aware that all nodes are equivalent by default (and they elect
a
master node themselves), but by changing the default settings, you can
make
a node not master eligible (node.master=false or node.client=true) and
you
can decide whether a node has data (node.data).
Using node.data=false and node.client=true, you're effectively creating
a
node that will not serve indices directly, but will redirect requests to
"data nodes" and aggregate the results.
why do you think that adding a separate no-data node would be
beneficial? what should be the advantages overs directing the queries
directly to a data node? As the data node needs to process the query
nevertheless.
In your example, how does having 3 "master eligible" nodes help in a 200
node cluster? Is this for avoiding split brain situations?
Thanks,
Philippe
On Sat, Jan 7, 2012 at 14:48, Shay Banon kimchy@gmail.com wrote:
Having just "load balancing nodes" (non master, non data) will not help
that much. I have seen cases where it was used to run it locally with the
relevant client code for HTTP access since it was connecting over loopback
and ES had better network handling for remote access to nodes. "Just" data
nodes still serve requests, even if they are coming from client nodes.
Dedicated master nodes can become handy in certain situations. For very
large clusters they can help (i.e. 200 data nodes with 3 "eligible" master
nodes).
Yes, I'm aware that all nodes are equivalent by default (and they
elect a
master node themselves), but by changing the default settings, you can
make
a node not master eligible (node.master=false or node.client=true) and
you
can decide whether a node has data (node.data).
Using node.data=false and node.client=true, you're effectively
creating a
node that will not serve indices directly, but will redirect requests
to
"data nodes" and aggregate the results.
why do you think that adding a separate no-data node would be
beneficial? what should be the advantages overs directing the queries
directly to a data node? As the data node needs to process the query
nevertheless.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.