I our team uses elasticsearch-spark and are currently in the process of upgrading our spark to version 2.0.0. We get the following errors:
[error] bad symbolic reference. A signature in Column.class refers to type Logging
[error] in package org.apache.spark which is not available.
[error] It may be completely missing from the current classpath, or the version on
[error] the classpath might be incompatible with the version used when compiling Column.class.
[error] bad symbolic reference. A signature in SQLContext.class refers to type Logging
[error] in package org.apache.spark which is not available.
[error] It may be completely missing from the current classpath, or the version on
[error] the classpath might be incompatible with the version used when compiling SQLContext.class.
[error] bad symbolic reference. A signature in DataFrameReader.class refers to type Logging
[error] in package org.apache.spark which is not available.
[error] It may be completely missing from the current classpath, or the version on
[error] the classpath might be incompatible with the version used when compiling DataFrameReader.class
We believe this is due to this change in Spark 2.0.0:
The following features have been removed in Spark 2.0:
Bagel
Support for Hadoop 2.1 and earlier
The ability to configure closure serializer
HTTPBroadcast
TTL-based metadata cleaning Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
Is this a known issue? If so, are there any plans to fix it?
With Spark 2.0 not being backwards compatible with 1.6, and the fact that it is the new default spark version for ES-Hadoop, we decided to align the development with the 5.0 release. We felt that this breaking of binary compatibility should only be realized with a major version increase.
While I highly advise not using the beta release for a production deployment, I do suggest that you perform your testing with the beta to ensure a successful rollout when 5.0 eventually lands.
How is Spark not backwards compatible? The Spark 1.6 to 2.0 upgrade is fairly simple as you can see from the upgrading docs.
Right now I'm finishing up migrating our ETL jobs to Spark 2.0 and next I'll be updating the elasticsearch publishers that use es-hadoop. We have a 4TB elasticsearch cluster that we will not be upgrading to 5.0 this year.
What are my options for using Spark 2.0 with elasticsearch 2.4.0?
The biggest item that causes the binary incompatibility between 1.3-1.6 and 2.0 is the removal of the DataFrame and the addition of Dataset. DataFrame continues to exist for users, but is just a type alias for Dataset[Row]. This allows code written by users to continue to work with just a simple recompilation.
However, since ES-Hadoop is built by extending native Spark interfaces and must be distributed to other users, the compiled classes cannot reconcile the differences between 1.3-1.6 and 2.0 at runtime. Thus we have two separate distributions for the versions in 5.0.
It is also our policy to keep the default versions used in the main ES-Hadoop jar in lock step with the latest version of the technology that we support. Because of this, support for Spark 1.3-1.6 has been added as a separate artifact for backwards compatibility in 5.0.
When ES-Hadoop v5.0 is released, it will support both Spark 2.0 and Elasticsearch 2.4.0. I am sorry to say that earlier versions of ES-Hadoop will not have support for Spark 2.0 due to the disruptiveness of the version change on the rest of the project. We have decided to reserve this breaking change for a major version release in keeping with the semantic versioning principles.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.