Issue Using the Connector from PySpark in 7.17.3

I have Elasticsearch 7.17.3 running on localhost:9200, so I downloaded Elasticsearch for Apache Hadoop 7.17.3. Then I copied Elasticsearch-spark-20_2.11-7.17.3.jar to my jar folder.

I read "Using the connector from PySpark" and was able to read data from Elasticsearch by adapting the first snippet with newAPIHadoopRDD as shown below.

conf = { "es.resource" : "myindex" }
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
es_rdd.first()

But when I follow the second snippet with org.elasticsearch.spark.sql and run the code shown below, I get the following NoClassDefFoundError error:

sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("myindex")
df.printSchema()
Py4JJavaError: An error occurred while calling o88.load.
: java.lang.NoClassDefFoundError: scala/Product$class
	at org.elasticsearch.spark.sql.ElasticsearchRelation.<init>(DefaultSource.scala:221)
	at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:97)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:748)

What am I doing wrong? For your information, I am using PySpark 3.2.1.

This looks similar to Elasticsearch-hadoop Scala version update to 2.12 · Issue #1949 · elastic/elasticsearch-hadoop · GitHub. We don't make it clear enough in our documentation yet, but the Elasticsearch-spark jar in the big Elasticsearch-hadoop artifact only works spark 2.x with scala 2.11. In that spark jar name (Elasticsearch-spark-20_2.11-7.17.3.jar), the first number (20) is code for spark version 2.x. The second number (2.11) is the scala version. The third number (7.17.3) is the Elasticsearch version. Unfortunately neither scala nor spark tend to be compatible across versions.
So for spark 3.2.1 you'll want elasticsearch-spark-30_2.12. You'll have to use scala 2.12 if you are on 7.17.3. Scala 2.13 was only added in es-hadoop 8.1.0.

1 Like

This is a little messy to try to capture on this site, but here's a view of which versions of scala and spark work with each es-spark artifact in the latest version of es-hadoop (8.2). The main differences with 7.17 are that scala 2.13 does not work on 7.17 and (I believe) spark 3.2.x is not supported on 7.17.

es-hadoop 8.2 Scala Version
2.10 2.11 2.12 2.13 3.0
1.3-1.6 elasticsearch-spark-13_2.10 elasticsearch-spark-13_2.11 X X X
Spark Version 2.x elasticsearch-spark-20_2.10 elasticsearch-spark-20_2.11 elasticsearch-spark-20_2.12 X X
3.0-3.1 X X elasticsearch-spark-30_2.12 X X

3.2

X X elasticsearch-spark-30_2.12 elasticsearch-spark-30_2.13 X

Thanks @Keith_Massey.

elasticsearch-spark-30_2.12 works with PySpark 3.2.1 and Elasticsearch 7.17.3.

I am new to Spark, and it did not occur to me that I had to check the version of Scala as I only use Python and PySpark. I should have paid attention to the version compatibility matrix on the Elasticsearch for Apache Hadoop 7.17 installation page that shows the following:

Spark Version Scala Version ES-Hadoop Artifact ID
2.0+ 2.11 elasticsearch-spark-20_2.11
2.0+ 2.12 elasticsearch-spark-20_2.12
3.0+ 2.12 elasticsearch-spark-30_2.12

It now makes sense that elasticsearch-spark-20_2.11-7.17.3.jar in the downloaded zip file does not work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.