Spark ES Read Error

dlSpark · May 22, 2020, 4:34am

We have several index(s) stored in out ES (7.1.1) cluster. edit-- Spark is 2.3.1, Scala 2.11.8, Hadoop HDP 3.0.1

When reading as a dataframe by
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", nodeList)
val x = reader.load("small_index")
x.show()

for the smaller index (about a million entries) it works ok

for a larger index (3 billion entries) we get the error below after any action on the dataframe. The job has over 300G allocated should not be a Spark resource issue (?).

Any suggestions welcome

> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid response
>         at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:60)
>         at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:271)
>         at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:262)
>         at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:313)
>         at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:93)
>         at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doA                                                                      ggregateWithoutKey_0$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.process                                                                      Next(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegen                                                                      Exec.scala:614)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java                                                                      :125)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:109)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)

Luca_Belluccini · May 22, 2020, 6:31am

I would suggest to:

enable the debug logs in the Elasticsearch spark library to show what is the actual "invalid response" https://www.elastic.co/guide/en/elasticsearch/hadoop/current/logging.html
you can tweak the settings es.scroll.keepalive (default 10m) and es.scroll.size (default 50). In particular I would increase the size.

dlSpark · May 22, 2020, 7:00am

Thanks for the response.

just changing those settings (tried size at 1000, 20000, 10000) had no effect.

is there any logging I can change with user permissions, getting the admins to change hadoop config files requires some time and justification

system · June 19, 2020, 7:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1405	March 22, 2022
Facing EsHadoopIllegalStateException when reading from a ES cluster Elasticsearch es-hadoop	3	3831	July 6, 2017
EsHadoopInvalidRequest: An HTTP line is larger than 4096 bytes Elasticsearch es-hadoop	5	4250	August 22, 2018
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: returned [400\|Bad Request:] Elasticsearch es-hadoop	3	3156	September 2, 2017
ElasticSearch Spark Hadoop Connector Elasticsearch es-hadoop	2	1091	July 6, 2017

Spark ES Read Error

Related topics