Unable to overwrite dataframe when documents have parent


(Brendan Kerrison) #1

I am trying to write a dataframe to Elasticsearch from Spark using ES-Hadoop 5.5.1. The documents are being written as children of pre-existing parent documents. Everything is working correctly the first time the DF is saved however it fails when attempting overwrite with:

  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o153.save.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9205] returned    Bad Request(400) - routing is required for [test_content]/[scorearticle]/[0-news-trending-movavg-scorearticle-ZW0382A001S00]; Bailing out..

Looking at the code, it does not look like the routing or parent of the documents being deleted is used when creating the bulk delete requests:

Is there any way to get around this? Or do we need to look at deleting records ourselves prior to writing.


(James Baiera) #2

Yeah, that would indeed be the case here. Thanks for bringing this to our attention. Could you open an issue on the Github project with this information?


(Brendan Kerrison) #3

Will do.


(Brendan Kerrison) #4

(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.