Python, Elasticsearch and Apache Spark, Simple data reading

I'm trying to read data from Elasticsearch using Apache Spark, the Hadoop conector and Python language. And it's give me a really true headache.

The Python Code is:

conf = SparkConf()
conf.setMaster("local")
conf.setAppName("New APP")
conf.set("es.nodes", "192.168.0.3:9200")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.resource", "my.index-name-2022").load()

The launching line is:

.\spark-submit --master local --jars ..\mjars\elasticsearch-hadoop-8.2.2.jar  'E:\ts_rfd\rfd_pych\spark_test\main.py'

But I get:

Traceback (most recent call last):
  File "E:\ts_rfd\rfd_pych\spark_test\main.py", line 21, in <module>
    df = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.resource", "my.index-name-2022").load()
  File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 210, in load
  File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
  File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 111, in deco
  File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: scala/Product$class
        at org.elasticsearch.spark.sql.ElasticsearchRelation.<init>(DefaultSource.scala:228)
        at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:97)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 19 more

My actual setup is:

Apache Spark: 3.3.0-hadoop3
Elasticsearch: 8.2
Python: 3.9
Java: 1.8.0_333
Scala: 2.13.8
Elasticsearch hadoop: 8.2.2
OS: Windows 10

This is terrible, I have already try changing a lot of versions of the requirements.

Many Thanks

Hi @Silver137. I think that the problem is a scala version compatibility issue. Unfortunately neither spark nor scala are usually compatible across versions. The version that ships in the big hadoop jar (elasticsearch-hadoop-8.2.2.jar) is for spark 2 / scala 2.11. Since you are using scala 2.13 and spark 3.3, you want to use the elasticsearch-spark-30_2.13 artifact (Maven Central Repository Search). You can read a little more about this at Issue Using the Connector from PySpark in 7.17.3 - #3 by Keith_Massey.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.