Throttling indexing to Elasticsearch in Spark

shakdoesspark · October 27, 2016, 8:39pm

Is there a way to throttle the number of tasks to index Elasticsearch but not throttle the number of tasks
for the compute (map, flatmap) tasks in spark?

We're using the native ES-Hadoop plugin.

sukumar · October 27, 2016, 8:45pm

@shakdoesspark The number of task to index the ES is based on the shards in ES.

shakdoesspark · October 27, 2016, 8:49pm

So I have 5 shards on my index, 0 replicas, and I'm using the RDD.saveToEsWithMeta method, but I'm seeing 256 tasks being created. I have 8 nodes in my spark cluster, and I've set the --executor-cores to 4.

I'm seeing that it's running 32 Tasks( 8 nodes * 4 cores per Executor ).

james.baiera · October 27, 2016, 9:55pm

Not true: The number of tasks used to READ from ES is based on the number of shards it is reading from. You can write to ES using any number of tasks, there's no way for the library to control this as it is a user setting in both Hadoop and Spark. We just suggest using a number of partitions equal to the shards being written to as a starting point and for tuning to go from there.

@shakdoesspark I would advise using the RDD.repartition(x: Int) method to shrink the number of splits or to modify the original number of splits on the RDD to be a lower number.

jspooner · January 17, 2017, 11:33pm

From my experience the answer is no. I have been saving my processed data into HDFS or S3 then I have a separate job read and push to ES.

I've also been using Zeppelin notebooks so setting --num-executors and cores is a little more difficult. So I've been setting them directly on the config before calling saveToEs

spark.conf.set("spark.dynamicAllocation.enabled","false")
spark.conf.set("spark.executor.instances", "8") // --num-executors
spark.conf.set("spark.executor.cores", "4");
df.saveToEs(esConfig)

However when visiting the Executors tab in the SparkHistory UI the summary never matches up to what I set.

I find it difficult to get visibility into the task that are actually running. I have enabled the extra logging but those logs are created on each task node and I have not tried to use them to gage task usage yet.

shakdoesspark · January 18, 2017, 2:51am

RDD.repartition works great!
@jspooner, you should give that a try!

no_jihun · January 28, 2017, 11:53pm

You'd better use df.coalesce (...) than df.repartition (...) for spark effeciency.

shakdoesspark · January 29, 2017, 12:21am

After coalesce, why would you need to repartition ?

no_jihun · January 29, 2017, 6:44am

no, I mean
use coalesce instead of repartition.

shakdoesspark · February 3, 2017, 4:52pm

ah ok thanks for the clarification

Topic		Replies	Views
Throttle the ES-Hadoop write speed Elasticsearch es-hadoop	3	629	September 29, 2020
saveToEs Write performance (elasticsearch-spark) Elasticsearch es-hadoop	3	2735	July 6, 2017
Spark tuning for Elasticsearch - how to increase Index/Ingest throughput Elasticsearch es-hadoop	3	4507	July 6, 2017
Elasticsearch-Hadoop Data Locality Elasticsearch	2	942	July 6, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	382	July 5, 2017

Throttling indexing to Elasticsearch in Spark

Related topics