Question regarding cluster setup

Hey guys,
We are attempting to deploy our first cluster after lots and lots of testing. I had couple questions regarding a query node (master: false, data: false). Is there any requirements in terms of how powerful the query node has to be? In our case, we have 3 nodes which are master/data (1x master: true, data: true, 2x master: false, data: true) which are quite powerful (AWS R3.4XL), while we have a dedicated query node which is just a regular XL machine. I wonder if this will cause any issues? Secondly, what's the performance hit on the master/data nodes on queries? Will it have significant impact on their performance? We want to ensure as minimal performance hit as possible when indexing. Although, the three master/data nodes are only doing 50% of their max indexing capacity, I still want to ensure that continuously running query against the query node won't have significant impact on the cluster's indexing performance.

I recommend dedicated master nodes (data:false, http:false) to help manage cluster state. They don't need to be very powerful - might even be able to get away with a t2.micro for master-only nodes. The dedicated query nodes should be sized as a function of the kinds of aggregations you perform. Data-only nodes will still participate in search in addition to indexing.

If you'll have time based indices, I recommend having an index mgmt plan (see curator) to help with performance in the long term: "optimizing", closing or reallocating older indices to slower storage, defining aliases to prevent unexpectedly querying all indices.

It's difficult to recommend cluster specs since depends a lot on your application. Those are some of the things I've found to help.

There are not requirements per se, but since the primary responsibility of these nodes is to route shard-level requests and coordinate responses, CPU and memory are the key resources here.

The r3.4xlarge machines are beasts. Have you considered just making all three of these nodes be master-eligible (and you absolutely must set discovery.zen.minimum_master_nodes to 2)? With this setup you have a single point of failure for your cluster.

The two things that I'd be thinking about are whether or not you should make all the data nodes master-eligible (I'd lean towards "yes" without a very good reason for "no"), and whether or not you really need the complexity of a query node (you'll also see this referred to as a "client node"; confusion abounds). Additionally, you now have a single point of failure for your queries, unless you design your clients to failover to the cluster if query node dies. Do you see the complexity adding up now?

They do take some load off of the data nodes on queries, but the requests still ultimately have to land on the shards. The main responsibility that they are taking away is the coordination of responses (aggregating the shard-level requests and sending the response off to the client). Only you know your workload and whether or not the query load is going to be high enough that you should:

  1. add complexity by adding another node with another role
  2. put all of that responsibility onto a single machine

It depends on your workload. I'm sorry to say that there is only one person that can answer this question and that person is you.

Again, the queries still hit the shards so they still hit the data nodes. The only way to know is to test and measure.

Thanks for the response guys. I have ran some queries already, but I'm not sure how to quantify the load on the data node. Is there a way to quantitatively measure the load? That way I can run the query with and without the client node to see the difference.

You could start by monitoring CPU usage (both user and system), load averages, and the heap usage of the Elasticsearch process over time with and without the query node. Of course, you'll also want to measure the impact on indexing (documents / seconds) and query response times (some percentile, taking care not to fall victim to coordinated omission).