We have several index(s) stored in out ES (7.1.1) cluster. edit-- Spark is 2.3.1, Scala 2.11.8, Hadoop HDP 3.0.1
When reading as a dataframe by
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", nodeList)
val x = reader.load("small_index")
x.show()
for the smaller index (about a million entries) it works ok
for a larger index (3 billion entries) we get the error below after any action on the dataframe. The job has over 300G allocated should not be a Spark resource issue (?).
Any suggestions welcome
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid response
> at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:60)
> at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:271)
> at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:262)
> at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:313)
> at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:93)
> at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doA ggregateWithoutKey_0$(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.process Next(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegen Exec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java :125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)