Spark Connector performance issue [thread contd]


I seem to have hit a problem in which Spark writing to Elasticsearch is very slow and it takes quite a lot of time (around 15 mins) in making the initial connection, during which both Spark and Elasticsearch remain idle.
There is another thread highlighting the same issue but it has been closed without any solution.

This is how I am writing from Spark to ES:
vgDF.write.format("org.elasticsearch.spark.sql").mode('append').option("es.resource", "demoindex/type1").option("es.nodes", "*ES IP*").save()

Spark specifications are as under

    Spark 2.1.0
    3 cpu x 10 gb ram x 6 executors 
    running on 3 gce nodesSpark 2.1.0

Elasticsearch specifications:

   8 cpu * 30 gb RAM single node


   Elasticsearch: 6.2.2
   ES-Hadoop: 6.2.2

Even after this 15 mins period, the ingestion rate is quite slow. It took around 45 mins (in total) to write only 961 rows from Spark to ES.

For your information, Spark reads data from Cassandra DB, process the results (but this process is quite fast, takes around 1 - 2 mins) and then writes to Elasticsearch.

Any help would be greatly appreciated


You could try taking a look at the network response times by using tools like tcpdump. This will give a better idea of where the hangup is occurring, either on the Hadoop end, the Elasticsearch end, or the network in between them.

I asked the same question at stackoverflow too, and there one person suggested me to change the Public IP (of ELK instance) to Private IP while ingesting data from Spark to ES.

This solved the issue of initial connection and slow writing by reducing the overall time period of ingestion from around 15-20 mins to only 12-15 seconds!

Hope, this may save other people's time as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.