Spark ES Read Error

We have several index(s) stored in out ES (7.1.1) cluster. edit-- Spark is 2.3.1, Scala 2.11.8, Hadoop HDP 3.0.1

When reading as a dataframe by
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", nodeList)
val x = reader.load("small_index")
x.show()

for the smaller index (about a million entries) it works ok

for a larger index (3 billion entries) we get the error below after any action on the dataframe. The job has over 300G allocated should not be a Spark resource issue (?).

Any suggestions welcome

> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid response
>         at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:60)
>         at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:271)
>         at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:262)
>         at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:313)
>         at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:93)
>         at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doA                                                                      ggregateWithoutKey_0$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.process                                                                      Next(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegen                                                                      Exec.scala:614)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java                                                                      :125)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:109)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)

I would suggest to:

Thanks for the response.

just changing those settings (tried size at 1000, 20000, 10000) had no effect.

is there any logging I can change with user permissions, getting the admins to change hadoop config files requires some time and justification

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.