I have the exact same question. I've got a cluster with a lot of data nodes
plus two nodes that act as master + client nodes (no data).
For now I'm using those two nodes for both master (shard/cluster
management) tasks and client tasks (query handling).
I've seen a big performance gain when querying the client nodes, compared
to querying my very busy data nodes directly.
But I'd still like to get your view on the hardware requirements of the
master/client nodes. Is RAM important for serving the query results, or is
most RAM-heavy tasks performed by the data nodes? And similarly, is CPU
important on the client nodes?
You should really use 3 master nodes if you have a lot of data nodes,
having 3 makes getting a quorum a lot easier.
I've previously run master nodes with 2 vcpus, 8GB RAM (4 heap) and 40 odd
data nodes, with sporadic querying and had no issues at all. Ultimately it
depends on your use case, but if you are having gains using your current
setup, then it makes sense to increase the hardware capabilities of what
you have and compare this to the previous setup, then make a call.
There is an excellent question asked about two years ago that was never
properly answered: Redirecting to Google Groups
I have the exact same question. I've got a cluster with a lot of data
nodes plus two nodes that act as master + client nodes (no data).
For now I'm using those two nodes for both master (shard/cluster
management) tasks and client tasks (query handling).
I've seen a big performance gain when querying the client nodes, compared
to querying my very busy data nodes directly.
But I'd still like to get your view on the hardware requirements of the
master/client nodes. Is RAM important for serving the query results, or is
most RAM-heavy tasks performed by the data nodes? And similarly, is CPU
important on the client nodes?
I was hoping to learn more about the requirements of the client nodes (for
querying only). What work is actually performed by them? Simply querying
the data nodes and merging the results, or is more heavy-weight
in-memory aggregation and sorting done of those nodes that need RAM and CPU
power?
You should really use 3 master nodes if you have a lot of data nodes,
having 3 makes getting a quorum a lot easier.
I've previously run master nodes with 2 vcpus, 8GB RAM (4 heap) and 40 odd
data nodes, with sporadic querying and had no issues at all. Ultimately it
depends on your use case, but if you are having gains using your current
setup, then it makes sense to increase the hardware capabilities of what
you have and compare this to the previous setup, then make a call.
On 18 November 2014 23:11, Lasse Schou <lasseschou@gmail.com
<javascript:_e(%7B%7D,'cvml','lasseschou@gmail.com');>> wrote:
Hi,
There is an excellent question asked about two years ago that was never
properly answered: Redirecting to Google Groups
I have the exact same question. I've got a cluster with a lot of data
nodes plus two nodes that act as master + client nodes (no data).
For now I'm using those two nodes for both master (shard/cluster
management) tasks and client tasks (query handling).
I've seen a big performance gain when querying the client nodes, compared
to querying my very busy data nodes directly.
But I'd still like to get your view on the hardware requirements of the
master/client nodes. Is RAM important for serving the query results, or is
most RAM-heavy tasks performed by the data nodes? And similarly, is CPU
important on the client nodes?
As kimchy answered, the real problem is that solitary nodes which hold no
shards and are not master do not help much.
The only motivations would be
move out HTTP connection management, e.g. when clients are slow and
appear in masses. The hardware requirement are low as long as the network
bandwidth is ok and there are lot of sockets/ file descriptors available.
running spare nodes instead of HTTP load balancing in nginx e.g. (nginx
is better in doing this)
When clients demand huge data sets from such nodes, there might be some
load on them regarding result aggregation but that is not a real problem in
comparison to the heavy duty nodes that hold the shards. The real load is
where the shards are.
I was hoping to learn more about the requirements of the client nodes (for
querying only). What work is actually performed by them? Simply querying
the data nodes and merging the results, or is more heavy-weight
in-memory aggregation and sorting done of those nodes that need RAM and CPU
power?
You should really use 3 master nodes if you have a lot of data nodes,
having 3 makes getting a quorum a lot easier.
I've previously run master nodes with 2 vcpus, 8GB RAM (4 heap) and 40
odd data nodes, with sporadic querying and had no issues at all. Ultimately
it depends on your use case, but if you are having gains using your current
setup, then it makes sense to increase the hardware capabilities of what
you have and compare this to the previous setup, then make a call.
There is an excellent question asked about two years ago that was never
properly answered: Redirecting to Google Groups
I have the exact same question. I've got a cluster with a lot of data
nodes plus two nodes that act as master + client nodes (no data).
For now I'm using those two nodes for both master (shard/cluster
management) tasks and client tasks (query handling).
I've seen a big performance gain when querying the client nodes,
compared to querying my very busy data nodes directly.
But I'd still like to get your view on the hardware requirements of the
master/client nodes. Is RAM important for serving the query results, or is
most RAM-heavy tasks performed by the data nodes? And similarly, is CPU
important on the client nodes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.