I'm trying to get a spark job running that pulls several million documents
from an Elasticsearch cluster for some analytics that cannot be done via
aggregations. It was my understanding that es-hadoop maintained data
locality when the spark cluster was running alongside the elasticsearch
cluster, but I am not finding that to be the case. My setup is 1 index, 20
ES nodes, 20 shards. There is one task created per elasticsearch shard,
but these tasks aren't distributed evenly among my spark cluster (eg. 3
spark nodes end up getting all of the work, and the tasks that a spark node
gets don't contain the ES shard from the same node.). Could there be
something wrong with my setup in this case?
The job does end up running correctly, but it is very slow (approx 5
minutes for a 3 million doc count) and there is clearly a lot of network
IO. Any tips would be appreciated.
For the record, what spark and es-hadoop version are you using?
For each shard in your index, es-hadoop creates one Spark task which gets informed of the whereabouts of the underlying
shard.
So in your case, you would end up with 20 tasks/workers, one per shard, streaming data back to the master.
Based on your description it sounds like you only have 3 tasks that iterate through all 20 shards - can you double check
your parallelism
on the Spark side and see whether there's any limit imposed?
Try running a basic read RDD and see whether the right number of tasks is being produced and that there are enough Spark
nodes to run them in
parallel.
P.S. Are you using the M/R layer or the native Spark integration/
On 12/31/14 7:17 PM, Elliott Bradshaw wrote:
I'm trying to get a spark job running that pulls several million documents from an Elasticsearch cluster for some
analytics that cannot be done via aggregations. It was my understanding that es-hadoop maintained data locality when
the spark cluster was running alongside the elasticsearch cluster, but I am not finding that to be the case. My setup
is 1 index, 20 ES nodes, 20 shards. There is one task created per elasticsearch shard, but these tasks aren't
distributed evenly among my spark cluster (eg. 3 spark nodes end up getting all of the work, and the tasks that a spark
node gets don't contain the ES shard from the same node.). Could there be something wrong with my setup in this case?
The job does end up running correctly, but it is very slow (approx 5 minutes for a 3 million doc count) and there is
clearly a lot of network IO. Any tips would be appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.