I am trying to query an ElasticSearch cluster for each record in a RDD i.e. a 10 million record RDD map over a function where it queries a remote ElasticSearch cluster over the transport client.
I've tried using both map and mapPartition (to ensure only one transport client is created for each partition). But I still get "No node available" exception thrown. My Spark config is as follows:
- 1 machine - I have tried using both 4g and 8g driver memory
Note: when num-executor=1 then it works fine, but then I guess Its no different to running it over one machine than over Spark.... So Ideally I would like to execute this with num-executor > 1.
Has anyone tried doing this before? Any suggestions is to solve this issue will be greatly appreciated. Thanks!