EsHadoopInvalidRequest

When we run:

{
  "from" : 0,
  "size" : 2147483647,
  "query" : {
    "bool" : {
      "should" : {
        "match" : {
          "_all" : {
            "query" : "News",
            "type" : "boolean",
            "analyzer" : "english"
          }
        }
      }
    }
  },
  "post_filter" : {
    "and" : {
      "filters" : [ {
        "terms" : {
          "streamId" : [ 2, 4, 5, 16, 19, 25, 26, 57 ],
          "execution" : "bool"
        }
      }, {
        "term" : {
          "_type" : "Document"
        }
      } ]
    }
  },
  "highlight" : {
    "pre_tags" : [ "<es_fts>" ],
    "post_tags" : [ "</es_fts>" ],
    "fragment_size" : 0,
    "number_of_fragments" : 0,
    "fields" : {
      "Document.Body" : { },
      "Document.OriginalUrl" : { },
      "Document.Title" : { },
      "Document.Url" : { },
  ...........
      "TwitterUser.Location" : { },
      "TwitterUser.ScreenName" : { },
      "TwitterUser.UserId" : { },
      "YouTubeVideo.Description" : { },
      "YouTubeVideo.Url" : { },
      "YouTubeVideo.Username" : { },
      "YouTubeVideo.VideoId" : { }
    }
  }
}

on EsRDD, we get:

[WARN ] [2015-08-04 17:42:12.377] o.e.h.r.RestRepository          : Read resource [fts*/Document] includes multiple indices or/and aliases; to avoid duplicate results (caused by shard overlapping), parallelism is reduced from 160 to 5
[INFO ] [2015-08-04 17:42:12.379] o.e.h.u.Version                 : Elasticsearch Hadoop v2.1.0.BUILD-SNAPSHOT [51847921a7]
[INFO ] [2015-08-04 17:42:12.379] o.e.s.r.ScalaEsRDD              : Reading from [fts*/Document]
[INFO ] [2015-08-04 17:42:12.450] o.e.s.r.ScalaEsRDD              : Discovered mapping {fts-swedish-20150728164001=[mappings=[Document=[Document.Body=STRING, Document.OriginalUrl=STRING, Document.Title=STRING, Document.Url=STRING, DocumentMetadata.FBAdmins=STRING, DocumentMetadata.FBAppID=STRING, DocumentMetadata.MetaAbstract=STRING, DocumentMetadata.MetaAppName=STRING, DocumentMetadata.MetaAuthor=STRING, ...YouTubeVideo.Url=STRING, YouTubeVideo.Username=STRING, YouTubeVideo.VideoId=STRING, _analyzer=STRING, language=STRING, streamId=LONG]]]} for [fts*/Document]
[WARN ] [2015-08-04 17:42:23.225] o.a.s.s.TaskSetManager          : Lost task 0.0 in stage 2785.0 (TID 3409, om-inv): org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [POST] on [_search/scroll?scroll=5m] failed; server[null] returned [404|Not Found:]
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:335)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:312)
        at org.elasticsearch.hadoop.rest.RestClient.scroll(RestClient.java:355)
        at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:401)
        at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
        at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

what does this error mean and how can we address it?
Thanks.

It looks like the end point was not found (404). Can you:

  • remove from and size from your query? These are handled automatically by the connector
  • use the 2.1.0 GA instead of the snapshot
  • if the error persists, turn on logging on org.elasticsearch.hadoop.rest to TRACE and post the resulting log as a gist?

Thanks,

Thank you for your answer.
I will try this advice.

Were you able to fix the issue ?

I am running ES 1.7 with elasticsearch-hadoop 2.1.2 and having the same problem with large data set. Same job runs fine with small set of data.
The job starts fine and processes few hundred records but later fails with below error:

5/11/08 23:58:15 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2.0 (TID 6)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [POST] on [_search/scroll?scroll=5m] failed; server[100.240.96.77:9200] returned [404|Not Found:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:368)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:331)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:321)
at org.elasticsearch.hadoop.rest.RestClient.scroll(RestClient.java:386)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:406)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:86)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:203)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/11/08 23:58:16 WARN org.apache.spark.ThrowableSerializationWrapper: Task exception could not be deserialized
java.lang.ClassNotFoundException: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at 15/11/08 23:58:16 ERROR org.apache.spark.scheduler.TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.util.MutableURLClassLoader@587

We upgraded as suggested and did not see the issue any more.

@sharad_khandelwal Likely there's an issue/bug that appears while processing your records however the root exception is hidden in the stacktrace. Because in case of an error Spark closes the ClassLoader, the initial exception gets obfuscated.