Indexing crashing due to large dataset

I have an Elasticsearch instance running on a docker. I am rootless user. I am also a beginner with Elasticsearch. My data size is 20GB, and they are medical documents. I have 80GB of RAM available. Here is how I configure my Elasticsearch.

Settings = {
    "settings": {
        "number_of_shards": 600,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type":"custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase","kstem"]
                }
            }, 
            "filter":{
                "kstem": {
                    "type":"kstem"
                }
            }
        }
    }, 
  "mappings": {
    "properties": {
      "ArticleTitle": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "AbstractText":{
          "type":"text",
          "analyzer":"custom_analyzer"
      },
      "PMID":{
          "type": "keyword",
          "index": "false"
      }
    }
  }
} 

My index crashes after indexing 12-15 GB of data. Please let me know if there is a more efficient way to configure my index given the constraints. I have tried increasing the heap size by custom JVM_options file. However it does not work. I configured the heap size to be 35GB.

Which version of Elasticsearch are you using?

Why have you set the number of primary shards to 600 if your data size is only 20GB?

How are you indexing data? What bulk size are you using? What is the average raw document size?

You should not set the heap beyond around 30GB as you want to benefit from compressed pointers.

Hi,

A common rule of thumb is to aim for shard sizes between 10GB and 50GB, so you might want to start with 1 or 2 shards and adjust based on your performance.

As Christian as said, the heap size to 35GB might not be optimal.

Regards

Since I could find a good guide on it, I selected an older version - 7.6.0.
I am using the Helpers bulk API for indexing.
My data is a pandas data frame of size 780MB. There are 5 such data frame which I pass to the API.
I read in the documentation that a good measure is 20 shards/GB. I planned to use 30GB of heap, and thus the 600 shards.

I will try indexing with reduced number of shards and heap size. I'll post any error I encounter here. I think I might be understanding the error incorrectly. Your help will be massively helpful!

It sounds like the size of your bulk requests may be very large. If this is the case I would recommend you shrink them. A good size to aim for is 5MB to 10MB.

You have misread this. The recommendation is to have an average shard size between 20GB and 50GB. This is what you should primarily aim for. The shard count to heap size is a maximum, not a recommended value. This came about as I saw a lot of clusters where users craeted far to many small shards and ended up with problems because of that. This does not apply to your use case, so you should likely go with 1 or 2 primary shards.

Thank you so much. I just indexed all my data successfully! It took a long time because I was occupied in other things.

I hope you changed the primary shard count to something more reasonable.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.