Large index size cause high Java heap occupation?

yehua984710 · August 6, 2015, 2:57am

Hi all,

We use Elasticsearch 1.6.0 and run two data nodes in two servers with 128G RAM and 24 Core CPU. ES java heap size is set to 30G and the index is configured to 5 shards with 1 replica.
Unlike common log files, our document is a bit complicated and average size of each document is about 500K. After bulk indexing around 12 million documents, the index size of each node in disk is about 5TB. Then ES servers become unstable. The slave sometimes loses connection with the primary. One more problem is even after old GC, the java heap occupation is still around 20G. Monitoring with Marval, I found the data of "Index Statistics->memory->LUCENE MEMORY" keeps on increasing. It is now about 30G. Does it related with Java heap occupation? What is the LUCENE MEMORY for? And anybody can suggest how to mitigate the memory issue?

tinle · August 6, 2015, 3:45am

You should add more nodes. Scale horizontally.

mosiddi · August 6, 2015, 4:59am

Why do you want to put everything in a single index with 5 shards... If I read correctly, you are having shards of ~1 TB.. The general guideline is not to have too many index/shards at the same time not too less. Lucene Memory is the memory leverage by Lucene (which ElasticSearch depends on) to deal with segments' dictionaries, buffers, etc.

yehua984710 · August 6, 2015, 4:10pm

@mosiddi thanks for your reply. You are right, now we have each shard with 1TB. So from your point of view, it is better to split the document to several parts and create index for each part. Eg Divide 10 million doc to 2*5million, then create index for each 5million, that will be better. Is it right?
And will it help if I extend shard number from 5 to 10? And how about add one more node but keep the shards number as 5?

tinle · August 6, 2015, 5:03pm

You want to keep shards smaller than 40G. So do both, increase shard number and add more nodes.

mosiddi · August 11, 2015, 3:23pm

Maintain reasonable size of shard. So if u have X docs in 1 index (5 shards) and 2X docs in 1 index (10 shards) and kind of maintaining balance of docs / shard. It is fine. Remember if ur search goes to all shards, 5 v/s 10 shard will matter more. In case of 10, you will have 10 sub-queries, though running in parallel but taking more time as you will be resource constrained.

yehua984710 · August 12, 2015, 2:36am

Now the Java heap usage is above 22G against the total assigned 30G. We have to stop the indexing. Any quick way to reduce the heap occupation? I tried to disable "_all" field but found it does not work because "_all" is enabled when index is created. Any quick solution for reduce the heap occupation?
Also I found ES does not support to change the number of primary shards after index is created. Now we already have 6T data and it is huge effort to rebuild the index. Will add not help since shard number cannot be changed?

mosiddi · August 12, 2015, 4:23am

If u can work w/o _all, then u can change ur query to do search based on fields so ES doesn't pick _all. Some examples @ https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-all-field.html. Remember the field data cache would have already built up, so the heap size will not go down but will be controlled. While there are ways to rebuild cache but it all depends on the scenario where u are trying this. As cache rebuilding is costly.

yehua984710 · August 12, 2015, 4:29am

Thanks for the answer. We are still in bulk indexing stage. No queries is on the cluster now.

mosiddi · August 12, 2015, 4:48am

thanks. did u read this blog: https://www.elastic.co/blog/found-understanding-memory-pressure-indicator?q=lucene memory#green-is-good . a good one to read which says when to worry v/s when it's ok.

yehua984710 · August 12, 2015, 5:09am

Good blog. Our ES already is above 75%

mosiddi · August 12, 2015, 5:25am

One orthogonal Q - Since u didn't allow querying functionality during indexing time, did u ensure the refresh setting is set to -1 [https://www.elastic.co/guide/en/elasticsearch/reference/1.6/indices-update-settings.html#bulk]

yehua984710 · August 12, 2015, 5:31am

I just set this to 20s in elasticsearch.yml
index.refresh_interval: 20s

Topic		Replies	Views
Ideal heap size for large RAM machine Elasticsearch	4	1473	January 12, 2018
Performance Issues Elasticsearch	3	456	July 6, 2017
Suggest for Heap size chnage and number of shards require Elasticsearch	6	3746	August 30, 2017
ES Heap and CPU Usage - bigger is better? Elasticsearch	2	388	July 6, 2017
Elasticsearch heap issues Elasticsearch	4	441	July 5, 2017

Large index size cause high Java heap occupation?

Related topics