Jvm Heap issues while indexing large data

Siddartha_Reddy1 · July 24, 2017, 11:29pm

Hi,

I have been trying to index a large dataset (with about 13000 fields and thousands of documents. Each doc size= around 450kb) from Spark through ESHadoop. The indexing goes well for small number of documents (hundreds) but fails with thousands. The logs show 'jvm spent 700ms in last 1s' and then runs into 'jvm heap out of memory' and the cluster goes down and hence the spark job.
I have a es cluster with 8 nodes 32gb memory each (less than 50% of available) (256GB overall) and enough disk space.
Used these settings:

"number_of_replicas": 0,
"number_of_shards": 8,
"refresh_interval": -1,
bootstrap.memory_lock: true,
indices.memory.index_buffer_size: 50%
default bulk doc size, entries, threads and queue.

What am i doing wrong here?
Any help is appreciated.

warkolm · July 24, 2017, 11:41pm

Are you using bulk?
What version are you on?

Siddartha_Reddy1 · July 25, 2017, 12:16am

I am using rdd.saveJsonToEs which i suppose uses bulk.
and using ES, kibana, xpack : 5.5 and hadoop connectors: es-spark-2.11-5.2.1.jar and es-5.2.1.jar.

Would the connector jars version mismatch be a problem ? (assuming this has little to do with heap space)

Siddartha_Reddy1 · July 26, 2017, 6:31pm

Hi Mark,
Do you have any suggestions for me on this?

warkolm · July 26, 2017, 7:54pm

I don't know es-hadoop sorry, you may want to edit the thread and move it to the hadoop section.

santosh_rachuri · July 28, 2017, 6:05pm

Hi we are using spark to load the data into elasticsearch using eshadoop connectors.So suggest me the version of ES and Spark you are using , so i can suggest you , how we are doing. It might help you

santosh_rachuri · July 28, 2017, 6:06pm

Hi we are using spark to load the data into elasticsearch using eshadoop connectors.So suggest me the version of ES and Spark you are using , so i can suggest you , how we are doing. It might help you....

james.baiera · July 28, 2017, 7:44pm

@Siddartha_Reddy1 In which processes are you experiencing the heap issues? If it's in the Spark workers, have you been able to measure the heap usage in your worker tasks at all using any profiling tools? Another option you might want to look into is lowering the batch sizes for the connector by setting the es.batch.size.bytes and es.batch.size.entries configurations.

Siddartha_Reddy1 · July 28, 2017, 8:59pm

I am using ES 5.5, es-spark 5.5.

james.baiera · July 28, 2017, 9:05pm

@Siddartha_Reddy1 Are you running into heap issues on the Elasticsearch side or on the Spark side?

Siddartha_Reddy1 · July 28, 2017, 9:05pm

Hi James, I was experiencing heap issues with ES nodes. (ES processes). However, the issue is now resolved after providing a mapping for all the columns. (Most them are set to "index':'no' and 'type': 'keyword').
Now the heap space is not even growing above 50%. But the new issue is the indexing speed is terribly low. ( just 9 docs/sec). Each document is of 450KB.

Siddartha_Reddy1 · July 28, 2017, 9:08pm

I have the following settings:

"indices.store.throttle.type" : "none"
"number_of_replicas": 0,
"number_of_shards": 14,
"refresh_interval": -1,
"mapping.total_fields.limit": "15000",
"merge.scheduler.max_thread_count" : 1,
"translog.flush_threshold_size": "2gb",
"translog.durability": "async",
"translog.sync_interval": "10s"

james.baiera · July 28, 2017, 9:10pm

@Siddartha_Reddy1 in that case you may need to increase the batch settings mentioned above. Have you done any measurements on the ingestion process? (rate of batch flushes from clients to server, length of request time, etc). In this case it's best to do some experimentation and profiling to determine where the bottleneck in the pipeline lies.

Siddartha_Reddy1 · July 28, 2017, 9:16pm

I haven't done any profiling yet but will try to increase the batch size and make some measurements and let you know.

Siddartha_Reddy1 · July 28, 2017, 10:46pm

@james.baiera have ran some indexing with default batch sizes and 100 executors, the request time for indexing is at an avg of 200 sec. and the throttling time is 0s. and there are lot of instances where request time is N/A for which no request for indexing has been received. The indexing rate was poorest.

Then tried with 50 executors and default batch sizes its better that previous but @9/s. and 150 sec request time and no throttling.

Christian_Dahlqvist · July 29, 2017, 10:56am

That is a massive number of fields. Given the size of the documents I assume each document only holds a subset. How did you end up with so many fields? (You may want to read this blog post and potentially revisit your mapping strategy)

Have you supplied a mapping that contains all fields or are you relying on Elasticsearch to perform dynamic mapping? If you are relying on dynamic mapping, Elasticsearch will need to update and distribute the cluster state whenever it finds a new set of fields, and with that number of fields that will get slower and slower as the size and complexity of the cluster state grows.

Siddartha_Reddy1 · July 31, 2017, 3:54pm

@Christian_Dahlqvist yes, i understand thats a huge set of fields and the jvm ran out of memory because of this.
Even though i have provided mapping for all the fields (not leaving upto es to provide dynamic mapping) the indexing is quite slow.
So, i have split 13000 columns into 13 types. But now, the issue is how can i club the results? Is there a way in ES to include fields from all the types in a single request?

system · August 28, 2017, 3:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
JVM heap usage is high Elasticsearch	7	1922	December 26, 2019
35 shards but maxing out JVM heap Elasticsearch	12	3900	April 5, 2018
JVM Heap size issue. ElasticSearch stops sometimes due to this error Elasticsearch	11	1047	June 12, 2023
Cluster stuck on high JVM heap usage Elasticsearch	4	951	July 5, 2017
Read from spark big index and memory usage Elasticsearch es-hadoop	6	1169	April 1, 2022

Jvm Heap issues while indexing large data

Related topics