EsHadoopInvalidRequest: An HTTP line is larger than 4096 bytes

Hi
We're trying to use spark-elasticsearch to load data from an Elasticsearch index into Spark for processing.

We load the data using:

sparkSession
      .read
      .format("es")
      .option("pushdown", "false")
      .option("es.nodes", nodesUrl)
      .option("es.port", "9200")
      .load(indexName)

When running we get the following error:

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: An HTTP line is larger than 4096 bytes.
{"query":{"match_all":{}}}
    at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:463)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:445)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
    at org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:263)
    at org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:261)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

We tried to debug it and realized that the request URL sent to Elasticsearch includes the "_source" parameter as a URL parameter and not put inside the body of the request.

The documents in the index we're reading have 350 fields, which is probably the reason for why the URL is so long.

Is there any way of overcoming this error?
Are we doing something wrong here?

Thanks

This is a known issue. There should be a fix landing soon: https://github.com/elastic/elasticsearch-hadoop/pull/1154

Thanks James!

Hi @james.baiera

I have the similar issue, how can fix it? thank you

@jamesjin the fix should be applied now. Please check out the newly released 6.3.2 to pull the change in.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.