Multiple ES clusters in SparkSQL

ludonara · November 25, 2015, 7:59am

Hi all,

is there a way to handle multipe ES clusters in a Spark application?

For example, use sparkSQL to query data from two ES clusters.

For now, i've just seen configuration of "es.nodes" in sparkConf.
Is it possible to specify "es.node" when read datasource?

ex: sqlContext.read.format("org.elasticsearch.spark.sql").option("es.node","X.X.X.X").load("logstash-2015.10.30/type")

Thank you.

ludonara · November 25, 2015, 9:58am

this is related to my question Using ES Spark to copy data from one instance to another and the response seems "no way for now"

costin · November 25, 2015, 9:21pm

I would rephrase this. Within the same RDD/DataFrame and thus within the same Spark task you can't read and write data at the same time.
You can however use a temporary storage to stream the data between them.

ludonara · November 26, 2015, 7:17pm

We would use Elasticsearch as datasource for a BI application.

In this use case, it would be a nice feature if we could specify es.node IP via "read..option" (to handle multiple ES conf and switch beetween them on the fly).

costin · December 8, 2015, 2:08pm

Have you tried using it? Any ES-Hadoop configuration can be passed per method call - the connector will merge them (last one wins) and run the job.

ludonara · January 5, 2016, 1:12pm

Hi, and happy new year.

Indeed, That works! Thank you.

We have just one other problem with array type and nested schema (as mentionned here : Spark-sql does not seem to read from a nested schema) to be fully compatible with Elasticsearch as datasource for our BI application.

Best regards,
Ludovic
CEO @datarocksIO

costin · January 9, 2016, 1:25pm

This has been fixed in the latest ES Hadoop release, 2.2-rc1 as described here.
Please try it out.

ludonara · January 11, 2016, 7:42am

That works fine for array of primitives types.

But I get a java.lang.NullPointerException on a field which is typed as NESTED on mapping.

Eg. of Mapping :
INFO ScalaEsRowRDD: Discovered mapping {index=[mappings=[dashboard=[doc_meta_id=STRING, id=STRING, language=STRING, sheets=NESTED, template=STRING, tenant_id=STRING, workbook_ids=STRING]]]} for [index/dashboard]

costin · January 11, 2016, 9:01am

Can you post a simple snippet in Spark which creates the DataFrame and then reads it?
JSON works just fine (and typically makes things easier).

Topic		Replies	Views
Using Spark DataSource with ES Hadoop Elasticsearch es-hadoop	2	678	July 6, 2017
Using ES Spark to copy data from one instance to another Elasticsearch es-hadoop	2	1430	July 6, 2017
Spark uses one ES node at a time to write to elastic search Elasticsearch es-hadoop	4	1806	November 8, 2017
ElasticSearch Hadoop connector - distribute the data Elasticsearch es-hadoop	1	689	December 13, 2016
Spark querying on ELK Stack Elasticsearch es-hadoop	2	1584	February 26, 2018

Multiple ES clusters in SparkSQL

Related topics