Unable to index documents into Elastic Cloud hosted managed service on GCP using Spark

Hi,
I am trying to insert documents in the index after reading documents from GCP storage. My Elasticsearch flavour on GCP is elastic cloud service hosted by elastic. for this, I am using version 7.5.0.

and my Gradle dependencies look like the below:-

dependencies {
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.3.4'
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.3.4'
compile group: 'org.elasticsearch', name: 'elasticsearch-hadoop', version: '7.5.0'
compile group: 'org.elasticsearch',name:'elasticsearch',version:'7.5.0'
}

In code, I am passing username and auth using spark.es.net.http.auth.user & spark.es.net.http.auth.pass property.

So, with the same code and dependency, if I just change the hostname to localhost and port no. to 9200 which is different for elastic cloud service, my code is able to successfully insert the documents in the index in the local elastic cluster. but while trying with elastic cloud hostname and 9243 port,
I am gettting the weird error
Exception in thread "main" java.lang.NoClassDefFoundError: org/elasticsearch/spark/sql/api/java/JavaEsSparkSQL
at com.cortex.spark.gcpSparktoES.main(gcpSparktoES.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Can you please help me with the solution for it.

For your information , same error is coming when i am submitting the job on dataproc cluster on GCP.

Its a little urgent , can someone please give a solution.

How are you running the job? Are both executions being run with spark submit, or are they on different platforms? The errors here seem to indicate that elasticsearch-hadoop is not on your job's classpath.

Additionally, I don't believe you need the base elasticsearch dependency, unless you are using something from it somewhere else in your code?

Thank @james.baiera for reply.
Both execution are not running with spark-submit. Locally i am able to execute the spark in eclipse. and in GCP ,i am submitting the job in dataproc through - spark-submit command, and elasticsearch-hadoop i have included while submitting the command as below:-

spark-submit --class com.spark.gcpSparktoES --jars gs://mybucket/esjars/.jar --driver-class-path gs://mybucket/esjars/.jar --m
aster yarn --executor-memory 2G --total-executor-cores 100 gs://mybucket/gcpSparktoES.jar gs://mybucket/data.csv

I have removed the base elastic search dependency.Now, After running the above spark-submit command , now i am getting the below error:-

Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

Though , i have already turned 'es.nodes.wan.only' flag as on.
please help.

Kindly Reply

That message indicates that ES-Hadoop cannot reach the cluster over the network. This may be due to how your networking is configured on your cloud deployment. Enabling TRACE logging on org.elasticsearch.hadoop.rest.commonshttp package should print out the HTTP request that the cluster is sending and the response that it is getting back.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.