Seeing low # of writes to elasticsearch using spark java.
Here are the Configurations
using 13.xlarge machines for ES cluster
4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.
Spark :
2 node EMR cluster with
2 Core instances
- 8 vCPU, 16 GiB memory, EBS only storage
- EBS Storage:1000 GiB
1 Master node
- 1 vCPU, 3.8 GiB memory, 410 SSD GB storage
ES has 16 shards defined in mapping
having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4
and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.
and in the spark log am seeing each task
-1116 bytes result sent to driver
Spark Code under main method
SparkConf conf = new SparkConf().setAppName("SparkES Application");
conf.set("es.nodes","");
conf.set("es.batch.size.bytes","6mb");
conf.set("es.batch.size.entries","10000");
conf.set("es.batch.concurrent.request","4"); conf.set("es.batch.write.refresh","false"); conf.set("spark.kryoserializer.buffer","24");
JavaSparkContext jsc = new JavaSparkContext(conf);
loadSNindex(jsc);
//end of main
public static void loadSNindex(JavaSparkContext jsc){
JavaRDD javaRDD = jsc.textFile("S3 PATH");
JavaEsSpark.saveJsonToEs(javaRDD,"Index name");
}
am i getting one connection from driver instead of one for each executor. how do i validate it?
Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?
OR
Is there any other config that I am missing to tweak. Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance
Thanks in advance