I have been trying to index a large dataset (with about 13000 fields and thousands of documents. Each doc size= around 450kb) from Spark through ESHadoop. The indexing goes well for small number of documents (hundreds) but fails with thousands. The logs show 'jvm spent 700ms in last 1s' and then runs into 'jvm heap out of memory' and the cluster goes down and hence the spark job.
I have a es cluster with 8 nodes 32gb memory each (less than 50% of available) (256GB overall) and enough disk space.
Used these settings:
I am using rdd.saveJsonToEs which i suppose uses bulk.
and using ES, kibana, xpack : 5.5 and hadoop connectors: es-spark-2.11-5.2.1.jar and es-5.2.1.jar.
Would the connector jars version mismatch be a problem ? (assuming this has little to do with heap space)
Hi we are using spark to load the data into elasticsearch using eshadoop connectors.So suggest me the version of ES and Spark you are using , so i can suggest you , how we are doing. It might help you
Hi we are using spark to load the data into elasticsearch using eshadoop connectors.So suggest me the version of ES and Spark you are using , so i can suggest you , how we are doing. It might help you....
@Siddartha_Reddy1 In which processes are you experiencing the heap issues? If it's in the Spark workers, have you been able to measure the heap usage in your worker tasks at all using any profiling tools? Another option you might want to look into is lowering the batch sizes for the connector by setting the es.batch.size.bytes and es.batch.size.entries configurations.
Hi James, I was experiencing heap issues with ES nodes. (ES processes). However, the issue is now resolved after providing a mapping for all the columns. (Most them are set to "index':'no' and 'type': 'keyword').
Now the heap space is not even growing above 50%. But the new issue is the indexing speed is terribly low. ( just 9 docs/sec). Each document is of 450KB.
@Siddartha_Reddy1 in that case you may need to increase the batch settings mentioned above. Have you done any measurements on the ingestion process? (rate of batch flushes from clients to server, length of request time, etc). In this case it's best to do some experimentation and profiling to determine where the bottleneck in the pipeline lies.
@james.baiera have ran some indexing with default batch sizes and 100 executors, the request time for indexing is at an avg of 200 sec. and the throttling time is 0s. and there are lot of instances where request time is N/A for which no request for indexing has been received. The indexing rate was poorest.
Then tried with 50 executors and default batch sizes its better that previous but @9/s. and 150 sec request time and no throttling.
That is a massive number of fields. Given the size of the documents I assume each document only holds a subset. How did you end up with so many fields? (You may want to read this blog post and potentially revisit your mapping strategy)
Have you supplied a mapping that contains all fields or are you relying on Elasticsearch to perform dynamic mapping? If you are relying on dynamic mapping, Elasticsearch will need to update and distribute the cluster state whenever it finds a new set of fields, and with that number of fields that will get slower and slower as the size and complexity of the cluster state grows.
@Christian_Dahlqvist yes, i understand thats a huge set of fields and the jvm ran out of memory because of this.
Even though i have provided mapping for all the fields (not leaving upto es to provide dynamic mapping) the indexing is quite slow.
So, i have split 13000 columns into 13 types. But now, the issue is how can i club the results? Is there a way in ES to include fields from all the types in a single request?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.