Optimizing a single node ES instance for throughput

dmh43 · June 21, 2022, 10:22pm

I have a single large (500GB ram and 128 cores) machine that I have indexed a corpus of around 1TB in size. The documents vary in length, from very short (~100 characters) to very long (a single documents that is > 100 MB in size).

Requests to the index arrive as part of a batch process, so I do not have strict requirements on latency, I mostly want to maximize retrieval throughput (I rarely index new documents).

My index has 50 shards, and requests are made to the index using multiple connections. I'm finding that if I have too many connections open, then I get ScanError: Scroll request has only succeeded on 25 (+0 skiped) shards out of 50 even though the machine does not seem particularly loaded. If I do not open enough connections, then the machine is mostly idle and throughput is low. I'm finding that the optimal number of connections depends on the queries I am making.

My questions:

Why am I getting the ScanError reported above? Is there some way to get some details as to what caused certain shards to not succeed? Do I just need to increase the number of shards?
Do you have any other suggestions when it comes to optimizing throughput for a single node cluster? Or even for ES clusters in general.
Is there some better way to choose the number of connections?

warkolm · June 21, 2022, 10:30pm

Welcome to our community!

You'd be better off splitting this host up into a few nodes. You're already running risk of losing data due to being on a single host, but a few nodes (eg 3) would probably make better use of the overall resources.

dmh43 · June 22, 2022, 2:22pm

Thanks! Why would spinning up additional nodes make better use of the computational resources? I would expect that there is some configurations I can change to allow a single node to take advantage of all it's resources.

rugenl · June 22, 2022, 5:47pm

Look for articles like this: Running Multiple Elasticsearch 6.x Instances on a Single Server

Java heap is limited to about 32G for best performance because it's Java, so multiple instances let you have more heap.

Anyone running in a cloud resource is highly likely to be sharing a host.

sandeepkanabar · June 30, 2022, 6:20am

Because the max recommended ES heap size is 32 GB. See this - Use more than 32GB for heap - #2 by spinscale. So in effect, you aren't able to fully utilise the full RAM.

So you can either set up multiple instances or you can set-up multiple VM (nodes). cost-wise it would actually work out lesser for you.

Also have a look at Managing and troubleshooting Elasticsearch memory | Elastic Blog

warkolm · June 30, 2022, 7:03am

That's not 100% accurate. Because the OS will cache commonly used files - by Elasticsearch - in any free memory.

sandeepkanabar · June 30, 2022, 9:41am

Ah, thanks for clarifying, Mark

system · July 28, 2022, 9:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple nodes on same machine Elasticsearch	5	538	July 6, 2017
Increasing ES search throughput Elasticsearch	4	2180	July 6, 2017
ElasticSearch Bulk indexing is not scaling Elasticsearch	7	2963	July 5, 2017
Improving Elasticsearch performance on a single node by increasing shards Elasticsearch	4	6703	July 6, 2017
ES Scalability Issues Elasticsearch	2	287	July 6, 2017

Optimizing a single node ES instance for throughput

Related topics