Writing PySpark dataframe to Elastic Cloud (Cannot detect ES version)

That's actually a very good question with a complicated answer! I believe you can only access the ELB in cloud, and it round-robins to all of the Elasticsearch nodes. That's a little unfortunate because that means it appears as a single node to es-hadoop, and if es-hadoop (or spark) gets too many failures from a node, it will blacklist the node for the remainder of the job. Since there's only one "node", this means that es-hadoop will bail out of your job if it sees a small number of failures, even if in reality you have dozens of nodes behind the load balancer.
One way that people have worked around this is to create several aliases for the load balancer (in their own DNS or even locally in /etc/hosts), and to list the same load balancer multiple times in es.nodes, using the different aliases. I've written this up somewhere before, but I don't remember where. I'll link to it from here if I find it.

2 Likes