Multiple ES clusters in SparkSQL

Hi all,

is there a way to handle multipe ES clusters in a Spark application?

For example, use sparkSQL to query data from two ES clusters.

For now, i've just seen configuration of "es.nodes" in sparkConf.
Is it possible to specify "es.node" when read datasource?


Thank you.

this is related to my question Using ES Spark to copy data from one instance to another and the response seems "no way for now" :sweat_smile:

I would rephrase this. Within the same RDD/DataFrame and thus within the same Spark task you can't read and write data at the same time.
You can however use a temporary storage to stream the data between them.

We would use Elasticsearch as datasource for a BI application.

In this use case, it would be a nice feature if we could specify es.node IP via "read..option" (to handle multiple ES conf and switch beetween them on the fly).

Have you tried using it? Any ES-Hadoop configuration can be passed per method call - the connector will merge them (last one wins) and run the job.

Hi, and happy new year.

Indeed, That works! Thank you.

We have just one other problem with array type and nested schema (as mentionned here : Spark-sql does not seem to read from a nested schema) to be fully compatible with Elasticsearch as datasource for our BI application.

Best regards,
CEO @datarocksIO

This has been fixed in the latest ES Hadoop release, 2.2-rc1 as described here.
Please try it out.

That works fine for array of primitives types.

But I get a java.lang.NullPointerException on a field which is typed as NESTED on mapping.

Eg. of Mapping :
INFO ScalaEsRowRDD: Discovered mapping {index=[mappings=[dashboard=[doc_meta_id=STRING, id=STRING, language=STRING, sheets=NESTED, template=STRING, tenant_id=STRING, workbook_ids=STRING]]]} for [index/dashboard]

Can you post a simple snippet in Spark which creates the DataFrame and then reads it?
JSON works just fine (and typically makes things easier).