I'm trying to read data from Elasticsearch using Apache Spark, the Hadoop conector and Python language. And it's give me a really true headache.
The Python Code is:
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("New APP")
conf.set("es.nodes", "192.168.0.3:9200")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.resource", "my.index-name-2022").load()
The launching line is:
.\spark-submit --master local --jars ..\mjars\elasticsearch-hadoop-8.2.2.jar 'E:\ts_rfd\rfd_pych\spark_test\main.py'
But I get:
Traceback (most recent call last):
File "E:\ts_rfd\rfd_pych\spark_test\main.py", line 21, in <module>
df = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.resource", "my.index-name-2022").load()
File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 210, in load
File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 111, in deco
File "E:\rfd_user\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at org.elasticsearch.spark.sql.ElasticsearchRelation.<init>(DefaultSource.scala:228)
at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:97)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 19 more
My actual setup is:
Apache Spark: 3.3.0-hadoop3
Elasticsearch: 8.2
Python: 3.9
Java: 1.8.0_333
Scala: 2.13.8
Elasticsearch hadoop: 8.2.2
OS: Windows 10
This is terrible, I have already try changing a lot of versions of the requirements.
Many Thanks