I have an Elasticsearch instance running on a docker. I am rootless user. I am also a beginner with Elasticsearch. My data size is 20GB, and they are medical documents. I have 80GB of RAM available. Here is how I configure my Elasticsearch.
My index crashes after indexing 12-15 GB of data. Please let me know if there is a more efficient way to configure my index given the constraints. I have tried increasing the heap size by custom JVM_options file. However it does not work. I configured the heap size to be 35GB.
A common rule of thumb is to aim for shard sizes between 10GB and 50GB, so you might want to start with 1 or 2 shards and adjust based on your performance.
As Christian as said, the heap size to 35GB might not be optimal.
Since I could find a good guide on it, I selected an older version - 7.6.0.
I am using the Helpers bulk API for indexing.
My data is a pandas data frame of size 780MB. There are 5 such data frame which I pass to the API.
I read in the documentation that a good measure is 20 shards/GB. I planned to use 30GB of heap, and thus the 600 shards.
I will try indexing with reduced number of shards and heap size. I'll post any error I encounter here. I think I might be understanding the error incorrectly. Your help will be massively helpful!
It sounds like the size of your bulk requests may be very large. If this is the case I would recommend you shrink them. A good size to aim for is 5MB to 10MB.
You have misread this. The recommendation is to have an average shard size between 20GB and 50GB. This is what you should primarily aim for. The shard count to heap size is a maximum, not a recommended value. This came about as I saw a lot of clusters where users craeted far to many small shards and ended up with problems because of that. This does not apply to your use case, so you should likely go with 1 or 2 primary shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.