I was wondering if there was a way to specify the preference
query param when initializing an RDD via JavaEsSpark.esRDD(sparkContext, "indexName/docType", esQuery);
?
At the moment there is no real way to do this. ES-Hadoop makes heavy use of the preference parameter as it is, so I'm interested in hearing your use case to see if it's something we can support better.
Hey James,
In my tests where I was monitoring network throughput on 15 data nodes, when the RDD starts streaming data, it appears to rely on a single coordinator node for the entire query to funnel data back into the Spark RDD. This proves to be a bottleneck for even modestly sized Spark clusters on modern hardware.
I ended up writing my own custom RDD that discovers all of the shards for a given index and launches the query independently to each shard with the preference parameter set. I found this approach to be about 30-40% faster in streaming throughput for my job which pulled about 400 million documents totaling about 10GB.
I'm using ES 2.4.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.