I have a single large (500GB ram and 128 cores) machine that I have indexed a corpus of around 1TB in size. The documents vary in length, from very short (~100 characters) to very long (a single documents that is > 100 MB in size).
Requests to the index arrive as part of a batch process, so I do not have strict requirements on latency, I mostly want to maximize retrieval throughput (I rarely index new documents).
My index has 50 shards, and requests are made to the index using multiple connections. I'm finding that if I have too many connections open, then I get
ScanError: Scroll request has only succeeded on 25 (+0 skiped) shards out of 50 even though the machine does not seem particularly loaded. If I do not open enough connections, then the machine is mostly idle and throughput is low. I'm finding that the optimal number of connections depends on the queries I am making.
- Why am I getting the
ScanError reported above? Is there some way to get some details as to what caused certain shards to not succeed? Do I just need to increase the number of shards?
- Do you have any other suggestions when it comes to optimizing throughput for a single node cluster? Or even for ES clusters in general.
- Is there some better way to choose the number of connections?
Welcome to our community!
You'd be better off splitting this host up into a few nodes. You're already running risk of losing data due to being on a single host, but a few nodes (eg 3) would probably make better use of the overall resources.
Thanks! Why would spinning up additional nodes make better use of the computational resources? I would expect that there is some configurations I can change to allow a single node to take advantage of all it's resources.
Look for articles like this: Running Multiple Elasticsearch 6.x Instances on a Single Server
Java heap is limited to about 32G for best performance because it's Java, so multiple instances let you have more heap.
Anyone running in a cloud resource is highly likely to be sharing a host.
Because the max recommended ES heap size is 32 GB. See this - Use more than 32GB for heap - #2 by spinscale. So in effect, you aren't able to fully utilise the full RAM.
So you can either set up multiple instances or you can set-up multiple VM (nodes). cost-wise it would actually work out lesser for you.
Also have a look at Managing and troubleshooting Elasticsearch memory | Elastic Blog
That's not 100% accurate. Because the OS will cache commonly used files - by Elasticsearch - in any free memory.
Ah, thanks for clarifying, Mark