Spark Connector performance issues - with start up time

I am currently experiencing performance issues where I notice that the Elasticsearch Spark connector is stuck 10's of minutes warming / starting up without any tasks / jobs allocated. See screen shot. Any idea on what is causing the 10's of minutes of warm up / startup time?

I have leverage both PySpark AND Spark Shell (Scala) and noticed the same issue

Here 's how I perform the query:

argusDF = sqlContext.read.format('org.elasticsearch.spark.sql')
.option('es.resource', 'argus*')
.option('es.nodes', esCluster)
.option('es.port', '9200')
.option('pushdown', 'true')
.option('es.scroll.keepalive', '2m')
.option('es.http.timeout', '2m')
.option('es.input.max.docs.per.partition', '10000')
.option('es.scroll.size', '5000')
.option('double.filtering', 'true')
.option('es.index.read.missing.as.empty', 'true')
.option('es.read.field.empty.as.null', 'true').load()
argusTable = 'argus'
argusDF.registerTempTable(argusTable)

sqlContext.sql('select ' + fieldsToQueryCSV + ' from ' + argusTable + ' where ' +
'src_ip= "'+ ipAddress + '" OR ' +
'dest_ip= "'+ ipAddress +'"')

@tranan - that's really peculiar. Are you able to get on to the Spark executor and collect a thread dump from the process using something like jstack and see what the process is stuck doing?

Attached is a screenshot from the executor tab. Note that there are a number of executors and cores that are live but no tasks.

Secondly I have clicked the thread dump link on one of the live executor and see the thread dump screenshot below. This is similar for all the live executors

Please note this is running on YARN. I have also reproduce this issue on another cluster running Spark Standalone pointing to the same Elasticsearch cluster. Also i want to note that the the regular REST and Java API works perfectly.

It seems the cluster is totally idle, waiting for work to be assigned. Would you be able to pull a thread dump from your Spark driver to see what it might be stuck doing that is keeping it from assigning work?

Below is the thread dump of the driver... I suspect the cluster is trying to get a file handle on all the shards that are in the cluster and blocking until it can. There's a total of ~3300 shards in the cluster ~ 30TB of data. I also noticed if i change the http timeout duration say to < 1M i get timeout exception.

I'm not sure I see any ES-Hadoop related classes in those thread dumps. ES-Hadoop does create at least one Spark partition for each shard in the indices that it wants to read from. It's possible that Spark is having a hard time handling that many partitions during its startup phase.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.