According to this post,
Generally speaking, you'll receive the optimal performance by using the same number of shards as nodes.
I have three dedicated master nodes and one data node in my cluster.
So I have four nodes in my cluster.
Does this mean I should set
number_of_shards to 4?
Or is it safe to ignore the data node and set
number_of_shards to 3?
Another similar question is about how to determine the value of
The formula for determining the value to use is: N / 2 + 1, where N is the total number of nodes in your cluster.
Should the value be 3 since there are four nodes in my cluster?
Or is it safe to ignore the data node and set the value to 2?
With only one data node, there's no reason to have 3 dedicated master nodes. You should have that single data node be both a master and a data node.
The reasoning behind 3 dedicated masters it so your cluster state is guaranteed by quorum still, in the event that one is brought down. Hence the formula,
N / 2 + 1, which is only for master eligible nodes, not including data nodes. In that sense, the linked post is incorrect. That would only be true if every node in the cluster is both a master node and a data node. We do not recommend that approach, and only recommend 3 total master nodes, except in extreme cases.
If you plan on adding data nodes in the future, then keep your 3 dedicated nodes. But if you only have a single node, then there's no reason for multiple masters, as there's only one node's worth of state to keep.
The reason why I created three dedicated nodes is because of the following saying from this doc.
By default, every Elasticsearch node is configured to be a "master-eligible" data node, which means they store data (and perform resource-intensive operations) and have the potential to be elected as a master node. For a small cluster, this is usually fine; a large Elasticsearch cluster, however, should be configured with dedicated master nodes so that the master node's stability can't be compromised by intensive data node work.
I was afraid of what's going to happen when my app's traffic increases in the future so I decided to use dedicated master nodes as described from the start.
If I don't have to use dedicated master nodes, how should I handle the incoming sudden increase of data input?
Let's say I create three master-eligible data nodes as my new cluster and my app's searching data input soars at one point.
Do I need to add more master-eligible data nodes or just data nodes in this case?
First and foremost: Your master nodes should never be used as client nodes. They should not receive any indexing or search traffic whatsoever, under any circumstances. They should only ever be nodes that watch and manage the cluster state.
Three shall be the number of the counting, and the number of the counting shall be three. Four shalt thou not count, neither shalt thou count two, excepting that thou then proceed to three...
With your dedicated masters configured as such, they will have absolutely no effect on the ability to handle burst traffic.
Your data nodes, especially while you have only 1, should act as both client and data nodes. Perform all indexing operations and searching operations using its IP address. Your "burst" traffic is handled exclusively by your client and/or data nodes. If you cannot sustain the traffic you're receiving, you should add data nodes first, eventually adding dedicated client nodes, and potentially considering increasing shard count. Those decisions should be planned for in the beginning, if at all possible, as some changes are harder to make after things get busy and crowded in your cluster.
There is a lot of context not included in this admonition, but this is a good start.
So I guess having three dedicated master nodes and one data node when initiating a cluster isn't a bad idea if I haven't misread your answer. Please correct me if I'm wrong.
Just one more question back from the first question.
What should be the value of
Should it be 4 or 3 in a situation where I have three dedicated master nodes and one data node?
Nope. If you plan to grow, then this is a good start.
That's the difficulty with your situation. With no context or understanding of your current or anticipated data flow, to say nothing of your data retention plans, it is very hard to anticipate what you'll need here.
Are you planning on storing time series data (e.g. logs, metrics, etc.)? If so, how long do you plan to keep indices? How do you plan to roll-over data? Daily indices? Monthly?
I'm not planning on storing time series data for now.
What is your data model, then? What is a burst of activity going to look like for you?
My data model is a cluster in which there are three dedicated master nodes and one node initially that can expand by adding data nodes when the data capacity nears the limit.
My scenario of a burst of activity is also like that, which is the situation where the data capacity nears the limit.
I'm sorry for not being clear. What kind of data? How is it being ingested? Why is it potentially coming in bursts? If it's not time-series data, why is ingestion not constant?
I have a web application that has search engine (Elasticsearch).
So searching functionality is all I need from Elasticsearch.
Users of my web app sign up their names and the names are searchable using Elasticsearch.
I believe the potential burst (sudden increase of data input) comes when my web app gets popular and the new users flow into signing their names, etc.
I want to be prepared for that situation to come.
I'm sorry I don't understand your last question of "why is ingestion not constant?".
If this is the case, what kind of traffic are you preparing for? How many new users per hour (peak)? How many searches per hour/min/sec?
Out of the box, a single node will handle a lot of traffic, no need for multiple masters. You can add more masters and make API call changes to update things (
minimum_master_nodes) on the fly when that day arrives. This is the beauty of Elasticsearch—you scale it to meet the needs you have. After hearing this, I'd start with a single node until you have enough traffic, e.g. 50% of the maximum load a single node can handle, before adding more nodes.
So would you recommend I create only one master-eligible data node initially and as the traffic grows, I add more data nodes?
But every doc I've read on Elasticsearch says I should have at least three nodes to start.
Unless you're starting with at least 2 data nodes, there's no reason to start out with 3 masters. The reason is that there is no redundancy of data, so there's no need for redundancy of masters. You can't even spread the load without multiple data nodes, and there would be no replica shards, so your cluster would always be in the yellow state.
Three master nodes is the way to start, but only if you're building a full cluster, which at minimum is 3 master nodes plus at least 2 data nodes. As mentioned, a single data node is not able to form a fully redundant cluster because there's no place for replicas. If you aren't going to have at least 2 data nodes to start, there's no reason to have more than 1 node anyway.
My cluster design: Please tell me if there is anything I should change.
Start with only one master-eligible data node and add more data nodes when needed.
number_of_shards to 1,
number_of_replicas to 1, and
discovery.zen.minimum_master_nodes to 1.
number_of_replicas set to 1 will still not make any replicas if you only have one data node. You can change the number of replicas for indices on the fly with an API call (which is also in Elasticsearch Curator), so in order to maintain a green cluster state, I'd set
number_of_replicas to 0 while you only have one data node.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.