Elastic search cluster setup suggestion required

Hi,
I have a cluster setup of 2 data nodes and 1 master node. One eligible master.
I handle about 400Gb of data.

All our search requests and indexing requests are made to the master node.
This caused some RAM issues on master while searching.

Need suggestion

  1. Should we have one coordinating node and route all requests to this node ?
  2. Is having one coordinating node a better choice or 2 ?
    2.1 To avoid split brain we want to make 3 eligible masters. Which ones ?
  3. Is it a good idea to make Data node an eligible master ?
  4. Is it a good idea to make search requests to the data nodes ?
  5. While initializing RestSearchClient, which hostnames should we mention for search ?
    5.1 My guess was 2 coordinating node hostnames
1 Like

In any cluster you generally want to have 3 master-eligible nodes (with minimum_master_nodes set to 2 to avoid split brain scenarios). This will allow the cluster to stay responsive even if it loses one node.

If you have a dedicated master node, which generally is less powerful that other nodes, which should be left to just managing the cluster and not serve traffic.

Whether coordinating-only node help or not depends on the use-case and workload.

If you have a single coordinating-only node that you send all requests to, this becomes a single point of failure, so 2 is generally better (if you need them at all).

It is important to have 3 master-eligible nodes, so unless you have 3 dedicated master node this is generally a good idea.

Often this is fine, but it does as I mentioned earlier depend on the use-case.

Coordinating-only nodes if you have them in the cluster, otherwise data nodes.

1 Like

Thanks @Christian_Dahlqvist . Appreciate the inputs.

With your inputs we have a plan:
1 Master node
2 Coordinating nodes (eligible masters)
2 Data nodes (One Replica)(RAID-0)
Avg doc size = 4Kb
Total docs = 250million
One Index.
Currently optimal 6 shards for our usecase.
We use routing while indexing and searching.

Few more Queries

  1. Can we increase search throughput by adding more data nodes and more replicas ?
  2. Would this infrastructure cause impact on indexing throughput ?
  3. Does this setup look good to you ?

Yes, that is generally the way to increase search throughput.

Not sure what you mean. Adding additional replicas will however have an impact on indexing load.

Assuming you have 3 master-eligible nodes it sounds reasonable, but I do not know as I am not familiar with your data, query load or requirements.

1 Like

Thanks again. One last thing i was wondering about.

In past we faced OOMemory from kernel which killed java process on master node and we had to allocate more RAM for OS to continue operations. This was when we were directing requests to master.

Query: Why was my OS memory of master node being utilized, when the master is not really using LUCENE operations to cache segments. All my data operations are made on data node. Master node is just scattering and gathering docs ?

Parsing and coordinating requests can consume a fair bit of heap as well. How much load do you serve and how much heap did you have assigned?

Our Master node was 8core 32bit machine.
Alloted heap space was recommended 50% i.e. 16Gb.

The space available for OS was the remaining 16Gb.
We were serving 3000 ES requests per second, thats about 50Mb of docs per sec.

The kernel OOMemory killer killed our master node java process and cluster was down.
Note: Java was not out of heap. OS did not get memory and it killed java to free up space.

Our solution:
We have put upgraded existing machine 16core and 64Gb.
Allotted heap space now is 16Gb and OS gets remaining 48Gb to play.
Now things are good.

Query Again:
Does master do any kind of LUCENE related operations ? AFAIK it just does QUERY and FETCH.

If you are running on a machine with a 32-bit architecture I believe you must have been running with a much smaller heap. Or was that a typo?

I would recommend installing monitoring (if you do not already have it) so you can see how heap usage varies over time.

1 Like

OMG blunder. My Mistake. I wanted to say 32Gb there. We use all 64-bit archi machines.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.