Specify 'preference' query param in spark

(Mgreene) #1

I was wondering if there was a way to specify the preference query param when initializing an RDD via JavaEsSpark.esRDD(sparkContext, "indexName/docType", esQuery);?

(James Baiera) #2

At the moment there is no real way to do this. ES-Hadoop makes heavy use of the preference parameter as it is, so I'm interested in hearing your use case to see if it's something we can support better.

(Mgreene) #3

Hey James,

In my tests where I was monitoring network throughput on 15 data nodes, when the RDD starts streaming data, it appears to rely on a single coordinator node for the entire query to funnel data back into the Spark RDD. This proves to be a bottleneck for even modestly sized Spark clusters on modern hardware.

I ended up writing my own custom RDD that discovers all of the shards for a given index and launches the query independently to each shard with the preference parameter set. I found this approach to be about 30-40% faster in streaming throughput for my job which pulled about 400 million documents totaling about 10GB.

I'm using ES 2.4.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.