Optimizing a single node ES instance for throughput

I have a single large (500GB ram and 128 cores) machine that I have indexed a corpus of around 1TB in size. The documents vary in length, from very short (~100 characters) to very long (a single documents that is > 100 MB in size).

Requests to the index arrive as part of a batch process, so I do not have strict requirements on latency, I mostly want to maximize retrieval throughput (I rarely index new documents).

My index has 50 shards, and requests are made to the index using multiple connections. I'm finding that if I have too many connections open, then I get ScanError: Scroll request has only succeeded on 25 (+0 skiped) shards out of 50 even though the machine does not seem particularly loaded. If I do not open enough connections, then the machine is mostly idle and throughput is low. I'm finding that the optimal number of connections depends on the queries I am making.

My questions:

  • Why am I getting the ScanError reported above? Is there some way to get some details as to what caused certain shards to not succeed? Do I just need to increase the number of shards?
  • Do you have any other suggestions when it comes to optimizing throughput for a single node cluster? Or even for ES clusters in general.
  • Is there some better way to choose the number of connections?

Welcome to our community! :smiley:

You'd be better off splitting this host up into a few nodes. You're already running risk of losing data due to being on a single host, but a few nodes (eg 3) would probably make better use of the overall resources.

2 Likes

Thanks! Why would spinning up additional nodes make better use of the computational resources? I would expect that there is some configurations I can change to allow a single node to take advantage of all it's resources.

Look for articles like this: Running Multiple Elasticsearch 6.x Instances on a Single Server

Java heap is limited to about 32G for best performance because it's Java, so multiple instances let you have more heap.

Anyone running in a cloud resource is highly likely to be sharing a host.

Because the max recommended ES heap size is 32 GB. See this - Use more than 32GB for heap - #2 by spinscale. So in effect, you aren't able to fully utilise the full RAM.

So you can either set up multiple instances or you can set-up multiple VM (nodes). cost-wise it would actually work out lesser for you.

Also have a look at Managing and troubleshooting Elasticsearch memory | Elastic Blog

That's not 100% accurate. Because the OS will cache commonly used files - by Elasticsearch - in any free memory.

1 Like

Ah, thanks for clarifying, Mark :+1: