I am Spark on Google Cloud, and I need to get data from an ElasticSearch database. I am using the following code to connect to ElasticSearch
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
public TransportClient openConnection(String ipAddress, int ipPort) throws UnknownHostException {
Settings settings = Settings.settingsBuilder().put("cluster.name", "elasticsearch").build();
TransportClient client = TransportClient.builder().settings(settings).build().
addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(ipAddress), ipPort));
return client;
}
When I run it locally, i.e. spark-submit --master local[*]
everything runs OK. When I run it in a google cloud spark cluster I get the following Exception:
jjava.lang.NoClassDefFoundError: Could not initialize class org.elasticsearch.threadpool.ThreadPool at
org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:133) at
javaTools.ElasticSearchConnection.openConnection(ElasticSearchConnection.java:24)
The last referred method (openConnection
) is the connection described above.
The code is uploaded to the google cloud using a fat jar created using sbt asssembly, so all libraries used are common, except for the native java ones.
I am thinking it might be some library dependency, since the same jar runs fine on my local computer and it is able to connect to the ElasticSearch server, but the same jar fails to run on the spark cluster on Google cloud. Both local and and cloud versions of Spark
are the same, 1.6.0.