Large index size cause high Java heap occupation?

Hi all,

We use Elasticsearch 1.6.0 and run two data nodes in two servers with 128G RAM and 24 Core CPU. ES java heap size is set to 30G and the index is configured to 5 shards with 1 replica.
Unlike common log files, our document is a bit complicated and average size of each document is about 500K. After bulk indexing around 12 million documents, the index size of each node in disk is about 5TB. Then ES servers become unstable. The slave sometimes loses connection with the primary. One more problem is even after old GC, the java heap occupation is still around 20G. Monitoring with Marval, I found the data of "Index Statistics->memory->LUCENE MEMORY" keeps on increasing. It is now about 30G. Does it related with Java heap occupation? What is the LUCENE MEMORY for? And anybody can suggest how to mitigate the memory issue?

You should add more nodes. Scale horizontally.

Why do you want to put everything in a single index with 5 shards... If I read correctly, you are having shards of ~1 TB.. The general guideline is not to have too many index/shards at the same time not too less. Lucene Memory is the memory leverage by Lucene (which ElasticSearch depends on) to deal with segments' dictionaries, buffers, etc.

@mosiddi thanks for your reply. You are right, now we have each shard with 1TB. So from your point of view, it is better to split the document to several parts and create index for each part. Eg Divide 10 million doc to 2*5million, then create index for each 5million, that will be better. Is it right?
And will it help if I extend shard number from 5 to 10? And how about add one more node but keep the shards number as 5?

You want to keep shards smaller than 40G. So do both, increase shard number and add more nodes.

Maintain reasonable size of shard. So if u have X docs in 1 index (5 shards) and 2X docs in 1 index (10 shards) and kind of maintaining balance of docs / shard. It is fine. Remember if ur search goes to all shards, 5 v/s 10 shard will matter more. In case of 10, you will have 10 sub-queries, though running in parallel but taking more time as you will be resource constrained.

Now the Java heap usage is above 22G against the total assigned 30G. We have to stop the indexing. Any quick way to reduce the heap occupation? I tried to disable "_all" field but found it does not work because "_all" is enabled when index is created. Any quick solution for reduce the heap occupation?
Also I found ES does not support to change the number of primary shards after index is created. Now we already have 6T data and it is huge effort to rebuild the index. Will add not help since shard number cannot be changed?

If u can work w/o _all, then u can change ur query to do search based on fields so ES doesn't pick _all. Some examples @ https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-all-field.html. Remember the field data cache would have already built up, so the heap size will not go down but will be controlled. While there are ways to rebuild cache but it all depends on the scenario where u are trying this. As cache rebuilding is costly.

Thanks for the answer. We are still in bulk indexing stage. No queries is on the cluster now.

thanks. did u read this blog: https://www.elastic.co/blog/found-understanding-memory-pressure-indicator?q=lucene memory#green-is-good . a good one to read which says when to worry v/s when it's ok.

Good blog. Our ES already is above 75%

One orthogonal Q - Since u didn't allow querying functionality during indexing time, did u ensure the refresh setting is set to -1 [https://www.elastic.co/guide/en/elasticsearch/reference/1.6/indices-update-settings.html#bulk]

I just set this to 20s in elasticsearch.yml
index.refresh_interval: 20s