I am trying to write a dataframe to Elasticsearch from Spark using ES-Hadoop 5.5.1. The documents are being written as children of pre-existing parent documents. Everything is working correctly the first time the DF is saved however it fails when attempting overwrite with:
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o153.save.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9205] returned Bad Request(400) - routing is required for [test_content]/[scorearticle]/[0-news-trending-movavg-scorearticle-ZW0382A001S00]; Bailing out..
Looking at the code, it does not look like the routing or parent of the documents being deleted is used when creating the bulk delete requests:
Is there any way to get around this? Or do we need to look at deleting records ourselves prior to writing.