Specify 'preference' query param in spark

mgreene · November 16, 2016, 9:11pm

I was wondering if there was a way to specify the preference query param when initializing an RDD via JavaEsSpark.esRDD(sparkContext, "indexName/docType", esQuery);?

james.baiera · November 18, 2016, 5:20pm

At the moment there is no real way to do this. ES-Hadoop makes heavy use of the preference parameter as it is, so I'm interested in hearing your use case to see if it's something we can support better.

mgreene · November 18, 2016, 9:18pm

Hey James,

In my tests where I was monitoring network throughput on 15 data nodes, when the RDD starts streaming data, it appears to rely on a single coordinator node for the entire query to funnel data back into the Spark RDD. This proves to be a bottleneck for even modestly sized Spark clusters on modern hardware.

I ended up writing my own custom RDD that discovers all of the shards for a given index and launches the query independently to each shard with the preference parameter set. I found this approach to be about 30-40% faster in streaming throughput for my job which pulled about 400 million documents totaling about 10GB.

I'm using ES 2.4.

system · December 16, 2016, 9:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SparkSQL: How to specify custom ROUTING when writting to ES Elasticsearch es-hadoop	2	1269	April 9, 2017
Spark, read data from ES, how to specify fields? Elasticsearch es-hadoop	9	13833	July 6, 2017
What is the best value to use for preference parameter when quering elastic search directly from custom dashboard? Elasticsearch	5	498	May 6, 2020
Specify es.query condition in HIVE SQL query? Elasticsearch es-hadoop	4	1413	July 6, 2017
Newbie question about Spark and Elasticsearch Elasticsearch	5	435	July 6, 2017

Specify 'preference' query param in spark

Related topics