Spark + Elastic search write performance issue

Seeing low # of writes to elasticsearch using spark java.

Here are the Configurations

using 13.xlarge machines for ES cluster

4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.

Spark :

2 node EMR cluster with

2 Core instances

  • 8 vCPU, 16 GiB memory, EBS only storage
  • EBS Storage:1000 GiB

1 Master node

  • 1 vCPU, 3.8 GiB memory, 410 SSD GB storage

ES has 16 shards defined in mapping

having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4

and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.

and in the spark log am seeing each task
-1116 bytes result sent to driver

Spark Code under main method
SparkConf conf = new SparkConf().setAppName("SparkES Application");
conf.set("es.nodes","");
conf.set("es.batch.size.bytes","6mb");
conf.set("es.batch.size.entries","10000");
conf.set("es.batch.concurrent.request","4"); conf.set("es.batch.write.refresh","false"); conf.set("spark.kryoserializer.buffer","24");

JavaSparkContext jsc = new JavaSparkContext(conf);
loadSNindex(jsc);

//end of main

public static void loadSNindex(JavaSparkContext jsc){
JavaRDD javaRDD = jsc.textFile("S3 PATH");
JavaEsSpark.saveJsonToEs(javaRDD,"Index name");
}

am i getting one connection from driver instead of one for each executor. how do i validate it?

Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?

OR

Is there any other config that I am missing to tweak. Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance

Thanks in advance

Have you done any profiling of your spark tasks to see if they're waiting on Elasticsearch more than processing or the other way around? Usually tools like visual vm will report some simple metrics on CPU time spent in parts of the code.

Additionally, it seems that your batch size bytes is a little low considering your average document size of 1300 b. A batch of 10000 docs at about 1300 bytes per doc would be closer to 12 mb in size (10000*1300/1024/1024 ~~ 12.4mb). It's possible that your batches are flushing a little too early if your goal is 10000 document batch requests.

Do note that es.batch.concurrent.request is not a configuration that the connector uses.

Your RDD partitions will be initially based on what is returned from the S3 data path used. If you are seeing low throughput, perhaps you might want to increase the parallelism by repartitioning your RDD.

Finally:

and in the spark log am seeing each task
-1116 bytes result sent to driver

Could you link your log file contents here? This seems like an error with logging.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.