Unable to index documents into Elastic Cloud hosted managed service on GCP using Spark

devender2601 · December 17, 2019, 10:04am

Hi,
I am trying to insert documents in the index after reading documents from GCP storage. My Elasticsearch flavour on GCP is elastic cloud service hosted by elastic. for this, I am using version 7.5.0.

and my Gradle dependencies look like the below:-

dependencies {
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.3.4'
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.3.4'
compile group: 'org.elasticsearch', name: 'elasticsearch-hadoop', version: '7.5.0'
compile group: 'org.elasticsearch',name:'elasticsearch',version:'7.5.0'
}

In code, I am passing username and auth using spark.es.net.http.auth.user & spark.es.net.http.auth.pass property.

So, with the same code and dependency, if I just change the hostname to localhost and port no. to 9200 which is different for elastic cloud service, my code is able to successfully insert the documents in the index in the local elastic cluster. but while trying with elastic cloud hostname and 9243 port,
I am gettting the weird error
Exception in thread "main" java.lang.NoClassDefFoundError: org/elasticsearch/spark/sql/api/java/JavaEsSparkSQL
at com.cortex.spark.gcpSparktoES.main(gcpSparktoES.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Can you please help me with the solution for it.

For your information , same error is coming when i am submitting the job on dataproc cluster on GCP.

devender2601 · December 17, 2019, 10:50am

Its a little urgent , can someone please give a solution.

james.baiera · December 17, 2019, 8:51pm

How are you running the job? Are both executions being run with spark submit, or are they on different platforms? The errors here seem to indicate that elasticsearch-hadoop is not on your job's classpath.

Additionally, I don't believe you need the base elasticsearch dependency, unless you are using something from it somewhere else in your code?

devender2601 · December 18, 2019, 1:03pm

Thank @james.baiera for reply.
Both execution are not running with spark-submit. Locally i am able to execute the spark in eclipse. and in GCP ,i am submitting the job in dataproc through - spark-submit command, and elasticsearch-hadoop i have included while submitting the command as below:-

spark-submit --class com.spark.gcpSparktoES --jars gs://mybucket/esjars/.jar --driver-class-path gs://mybucket/esjars/.jar --m
aster yarn --executor-memory 2G --total-executor-cores 100 gs://mybucket/gcpSparktoES.jar gs://mybucket/data.csv

I have removed the base elastic search dependency.Now, After running the above spark-submit command , now i am getting the below error:-

Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

Though , i have already turned 'es.nodes.wan.only' flag as on.
please help.

devender2601 · December 18, 2019, 6:57pm

Kindly Reply

james.baiera · January 6, 2020, 5:47pm

That message indicates that ES-Hadoop cannot reach the cluster over the network. This may be due to how your networking is configured on your cloud deployment. Enabling TRACE logging on org.elasticsearch.hadoop.rest.commonshttp package should print out the HTTP request that the cluster is sending and the response that it is getting back.

system · February 3, 2020, 5:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem between Spark and Elasticsearch Elasticsearch es-hadoop	2	2378	July 6, 2017
Spark 1.5.1 + Elastic Search Integration Elasticsearch es-hadoop	3	1421	July 6, 2017
Google Cloud Spark ElasticSearch TransportClient connection exception Elasticsearch	1	873	July 5, 2017
Elasticsearch-Pyspark Problem Elasticsearch es-hadoop	2	196	January 20, 2025
Unable to insert data into ES through spark-submit - works with pyspark Elasticsearch es-hadoop	5	1760	January 30, 2019

Unable to index documents into Elastic Cloud hosted managed service on GCP using Spark

Related topics