I have a single large (500GB ram and 128 cores) machine that I have indexed a corpus of around 1TB in size. The documents vary in length, from very short (~100 characters) to very long (a single documents that is > 100 MB in size).
Requests to the index arrive as part of a batch process, so I do not have strict requirements on latency, I mostly want to maximize retrieval throughput (I rarely index new documents).
My index has 50 shards, and requests are made to the index using multiple connections. I'm finding that if I have too many connections open, then I get ScanError: Scroll request has only succeeded on 25 (+0 skiped) shards out of 50 even though the machine does not seem particularly loaded. If I do not open enough connections, then the machine is mostly idle and throughput is low. I'm finding that the optimal number of connections depends on the queries I am making.
My questions:
Why am I getting the ScanError reported above? Is there some way to get some details as to what caused certain shards to not succeed? Do I just need to increase the number of shards?
Do you have any other suggestions when it comes to optimizing throughput for a single node cluster? Or even for ES clusters in general.
Is there some better way to choose the number of connections?
You'd be better off splitting this host up into a few nodes. You're already running risk of losing data due to being on a single host, but a few nodes (eg 3) would probably make better use of the overall resources.
Thanks! Why would spinning up additional nodes make better use of the computational resources? I would expect that there is some configurations I can change to allow a single node to take advantage of all it's resources.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.