Indexing crashing due to large dataset

juliebryan · February 7, 2024, 8:48am

I have an Elasticsearch instance running on a docker. I am rootless user. I am also a beginner with Elasticsearch. My data size is 20GB, and they are medical documents. I have 80GB of RAM available. Here is how I configure my Elasticsearch.

Settings = {
    "settings": {
        "number_of_shards": 600,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type":"custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase","kstem"]
                }
            }, 
            "filter":{
                "kstem": {
                    "type":"kstem"
                }
            }
        }
    }, 
  "mappings": {
    "properties": {
      "ArticleTitle": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "AbstractText":{
          "type":"text",
          "analyzer":"custom_analyzer"
      },
      "PMID":{
          "type": "keyword",
          "index": "false"
      }
    }
  }
}

My index crashes after indexing 12-15 GB of data. Please let me know if there is a more efficient way to configure my index given the constraints. I have tried increasing the heap size by custom JVM_options file. However it does not work. I configured the heap size to be 35GB.

Christian_Dahlqvist · February 7, 2024, 8:59am

Which version of Elasticsearch are you using?

Why have you set the number of primary shards to 600 if your data size is only 20GB?

How are you indexing data? What bulk size are you using? What is the average raw document size?

You should not set the heap beyond around 30GB as you want to benefit from compressed pointers.

yago82 · February 7, 2024, 9:03am

juliebryan:

I have an Elasticsearch instance running on a docker. I am rootless user. I am also a beginner with Elasticsearch. My data size is 20GB, and they are medical documents. I have 80GB of RAM available. Here is how I configure my Elasticsearch.
Settings = {
    "settings": {
        "number_of_shards": 600,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type":"custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase","kstem"]
                }
            }, 
            "filter":{
                "kstem": {
                    "type":"kstem"
                }
            }
        }
    }, 
  "mappings": {
    "properties": {
      "ArticleTitle": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "AbstractText":{
          "type":"text",
          "analyzer":"custom_analyzer"
      },
      "PMID":{
          "type": "keyword",
          "index": "false"
      }
    }
  }
} 
My index crashes after indexing 12-15 GB of data. Please let me know if there is a more efficient way to configure my index given the constraints. I have tried increasing the heap size by custom JVM_options file. However it does not work. I configured the heap size to be 35GB.

Hi,

A common rule of thumb is to aim for shard sizes between 10GB and 50GB, so you might want to start with 1 or 2 shards and adjust based on your performance.

As Christian as said, the heap size to 35GB might not be optimal.

Regards

juliebryan · February 7, 2024, 3:20pm

Since I could find a good guide on it, I selected an older version - 7.6.0.
I am using the Helpers bulk API for indexing.
My data is a pandas data frame of size 780MB. There are 5 such data frame which I pass to the API.
I read in the documentation that a good measure is 20 shards/GB. I planned to use 30GB of heap, and thus the 600 shards.

juliebryan · February 7, 2024, 3:28pm

I will try indexing with reduced number of shards and heap size. I'll post any error I encounter here. I think I might be understanding the error incorrectly. Your help will be massively helpful!

Christian_Dahlqvist · February 7, 2024, 4:05pm

It sounds like the size of your bulk requests may be very large. If this is the case I would recommend you shrink them. A good size to aim for is 5MB to 10MB.

You have misread this. The recommendation is to have an average shard size between 20GB and 50GB. This is what you should primarily aim for. The shard count to heap size is a maximum, not a recommended value. This came about as I saw a lot of clusters where users craeted far to many small shards and ended up with problems because of that. This does not apply to your use case, so you should likely go with 1 or 2 primary shards.

juliebryan · February 22, 2024, 5:47am

Thank you so much. I just indexed all my data successfully! It took a long time because I was occupied in other things.

Christian_Dahlqvist · February 22, 2024, 6:40am

I hope you changed the primary shard count to something more reasonable.

system · March 21, 2024, 6:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Improving Bulk Indexing Elasticsearch	12	4147	July 6, 2017
JVM Heap size issue. ElasticSearch stops sometimes due to this error Elasticsearch	11	1129	June 12, 2023
ElasticSearch Goes Down (Ram issue) Elasticsearch docker	7	671	August 4, 2021
Is there any problem that set ES heap size to 64G? Elasticsearch	2	230	March 27, 2023
Error while indexing -java heap space Elasticsearch	17	887	July 6, 2017

Indexing crashing due to large dataset

Related topics