Effect of cores on searching/indexing

Need some inputs on searching and indexing on below concept w.r.t data nodes:

What is the difference between having 2 data nodes of 16cores each
and 4 data nodes of 8cores each ? In both scenarios i have 32cores in all.

Why this question:
I have 2 data nodes with 16core each. Due to the number of in-memory terms created "memory_in_bytes" overshoots and breaks circuit. My current allocated heap size i.e. 10Gb.

I want to make sure my circuit does not break, by ensuring that the data stored on 2 nodes are divided into 4nodes, by not affecting the overall searching/indexing throughput w.r.t no of cores available.

3master, 2data, 2coord.
650 indexes and 2 replicas.
Total data size about 1Tb+.

Note: All disk space saving tips are followed and mappings is optimized.

Well, 4 nodes with 10G heap on each would be better than 2 nodes with 10G heap. Why is heap limited to 10G? How much ram is available?

RAM available is 64Gb of heap.
We have incrementally optimized our nodes to reach 10Gb heap allocation and allowed rest of the space for the "off heap lucene" to work.

The case now is that the 10Gb heap is proving to be less because of increased data and terms.

Standard is up to half of ram, but stay under the ~32G heap limits, ref

You'll have to decide if the system ram needs allow you to allocate more of the existing ram to heap or not. My guess, knowing nothing else about your systems, is that if 10G heap is causing pain now, increasing it would be better.

You might check the number of segments in use on the data nodes. If there are a lot of segments, running forcemerge on indices that are no longer being written to can help.

With only 2 data nodes, having 2 replicas (plus the primary) forces 2 copies of the same data to reside on disk on the same node. I'm not sure having more copies than data nodes is helpful.

Yup. The suggestions are all set in my app before this problem.

I can increase RAM for now as a workaround. I wonder if data goes on increasing, someone may hit the 32Gb mark.
The question is what happens after that. People in industry use a lot of petabytes of data in their ES. There should be some way to keep the in-memory usage to minimum and still reap the benefit out of ES.

Network bandwidth, HDD bandwidth, and IOs limits are all scaled with more data nodes.
You also get more reliability tolerance with more nodes, etc.
It's not just the CPU and RAM. But if you have the 2 large data nodes already and it's cheaper than 4 smaller nodes. Might as start with the 2 larger nodes.
I would go with more than 3 nodes minimally if I'm starting with a new cluster. Just my preference. It gives you the option to set replica to 2 without worrying about losing data when 1 node is down.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.