Hello I am trying to read data from ElasticSearch. My Code is as follows,
SparkConf sparkConf = new SparkConf()
.setAppName("Spark ES Integration").setMaster("local");
// .set("spark.ui.port", "7077");
sparkConf.set("es.nodes", "xx.xx.xx.xx");
sparkConf.set("es.port", "9200");
sparkConf.set("es.resource", "blog/post");
sparkConf.set("es.query", "?q=user:dilbert");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc);
System.out.println("**********" + esRDD.count()); // Prints 1 - Only one record is present
System.out.println("**********" + esRDD.first()); // Throws exception
Program outputs count correctly but throws exception when first record fetched from RDD.
Why request is of type POST when I just want to read the data?
What is wrong here? Query mentioned in configuration executes correctly.
Exception is
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [POST] on [blog/post/_search?search_type=scan&scroll=5m&size=50&preference=_shards:0;_only_node:03oYNb7BTjG2vOzo9lSnzQ] failed; server[null] returned [400|Bad Request:]
.
.
3091 [Executor task launch worker-0] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.util.TaskCompletionListenerException: [POST] on [blog/post/_search?search_type=scan&scroll=5m&size=50&preference=_shards:0;_only_node:03oYNb7BTjG2vOzo9lSnzQ] failed; server[null] returned [400|Bad Request:]
.
.
3099 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.util.TaskCompletionListenerException: [POST] on [blog/post/_search?search_type=scan&scroll=5m&size=50&preference=_shards:0;_only_node:03oYNb7BTjG2vOzo9lSnzQ] failed; server[null] returned [400|Bad Request:]
What version of Elasticsearch and ES-Spark are you using?
POST is used instead of GET to get around some encoding issues with large URIs (basically the search request is passed in the body instead of the URI).
Likely there's something else in the request that causes the 400.
Potentially you can enable logging on the REST package to see what requests are made and what causes the exception.
Hello costin, I managed to resolve my issue. The basic problem was I was executing my program on windows 32 bit machine and I was having 64 bit Winutils.exe. I migrated my code to linux machine. Now it works fine. Thanks for ur help!
This is weird. I do development and testing on Windows and things run fine. The CI builds on Linux - there should be no difference between the two OS.
Something else is at hand. Either way, I'm glad to see things are working out.
I am having similar issues with the ES Spark Connector.
When I connect to a remote 1 machine ES cluster, I can read the schema from ES.
Example:
val spark14DF = sqlContext.read.format("org.elasticsearch.spark.sql").options(optionsDev).load("blog/post")
But when I ask for spark14DF.first()
I get the following error:
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: IndexMissingException[[blog] missing]
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:352)
.....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.