Issue Using the Connector from PySpark in 7.17.3

ito · May 6, 2022, 4:52am

I have Elasticsearch 7.17.3 running on localhost:9200, so I downloaded Elasticsearch for Apache Hadoop 7.17.3. Then I copied Elasticsearch-spark-20_2.11-7.17.3.jar to my jar folder.

I read "Using the connector from PySpark" and was able to read data from Elasticsearch by adapting the first snippet with newAPIHadoopRDD as shown below.

conf = { "es.resource" : "myindex" }
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
es_rdd.first()

But when I follow the second snippet with org.elasticsearch.spark.sql and run the code shown below, I get the following NoClassDefFoundError error:

sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("myindex")
df.printSchema()

Py4JJavaError: An error occurred while calling o88.load.
: java.lang.NoClassDefFoundError: scala/Product$class
	at org.elasticsearch.spark.sql.ElasticsearchRelation.<init>(DefaultSource.scala:221)
	at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:97)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:748)

What am I doing wrong? For your information, I am using PySpark 3.2.1.

Keith_Massey · May 6, 2022, 10:18pm

This looks similar to Elasticsearch-hadoop Scala version update to 2.12 · Issue #1949 · elastic/elasticsearch-hadoop · GitHub. We don't make it clear enough in our documentation yet, but the Elasticsearch-spark jar in the big Elasticsearch-hadoop artifact only works spark 2.x with scala 2.11. In that spark jar name (Elasticsearch-spark-20_2.11-7.17.3.jar), the first number (20) is code for spark version 2.x. The second number (2.11) is the scala version. The third number (7.17.3) is the Elasticsearch version. Unfortunately neither scala nor spark tend to be compatible across versions.
So for spark 3.2.1 you'll want elasticsearch-spark-30_2.12. You'll have to use scala 2.12 if you are on 7.17.3. Scala 2.13 was only added in es-hadoop 8.1.0.

Keith_Massey · May 6, 2022, 10:36pm

This is a little messy to try to capture on this site, but here's a view of which versions of scala and spark work with each es-spark artifact in the latest version of es-hadoop (8.2). The main differences with 7.17 are that scala 2.13 does not work on 7.17 and (I believe) spark 3.2.x is not supported on 7.17.

es-hadoop 8.2		Scala Version
		2.10	2.11	2.12	2.13	3.0
	1.3-1.6	elasticsearch-spark-13_2.10	elasticsearch-spark-13_2.11	X	X	X
Spark Version	2.x	elasticsearch-spark-20_2.10	elasticsearch-spark-20_2.11	elasticsearch-spark-20_2.12	X	X
	3.0-3.1	X	X	elasticsearch-spark-30_2.12	X	X
	3.2	X	X	elasticsearch-spark-30_2.12	elasticsearch-spark-30_2.13	X

ito · May 7, 2022, 12:18am

Thanks @Keith_Massey.

elasticsearch-spark-30_2.12 works with PySpark 3.2.1 and Elasticsearch 7.17.3.

I am new to Spark, and it did not occur to me that I had to check the version of Scala as I only use Python and PySpark. I should have paid attention to the version compatibility matrix on the Elasticsearch for Apache Hadoop 7.17 installation page that shows the following:

Spark Version	Scala Version	ES-Hadoop Artifact ID
2.0+	2.11	`elasticsearch-spark-20_2.11`
2.0+	2.12	`elasticsearch-spark-20_2.12`
3.0+	2.12	`elasticsearch-spark-30_2.12`

It now makes sense that elasticsearch-spark-20_2.11-7.17.3.jar in the downloaded zip file does not work.

system · June 4, 2022, 12:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Connecting elastic search through pyspark Elasticsearch es-hadoop	2	1701	December 7, 2021
Python, Elasticsearch and Apache Spark, Simple data reading Elasticsearch es-hadoop	2	2670	July 19, 2022
java.lang.NoClassDefFoundError: Could not initialize class org.elasticsearch.common.network.NetworkService Elasticsearch es-hadoop	1	4889	May 5, 2017
Getting Error on sc.esRDD ES-Hadoop 5.2.2 Spark version 2.1.0 Elasticsearch es-hadoop	2	928	April 25, 2017
Spark 1.5.1 + Elastic Search Integration Elasticsearch es-hadoop	3	1410	July 6, 2017

Issue Using the Connector from PySpark in 7.17.3

Related topics