Unable to overwrite dataframe when documents have parent

Brendan_Kerrison · August 1, 2017, 5:54am

I am trying to write a dataframe to Elasticsearch from Spark using ES-Hadoop 5.5.1. The documents are being written as children of pre-existing parent documents. Everything is working correctly the first time the DF is saved however it fails when attempting overwrite with:

  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o153.save.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9205] returned    Bad Request(400) - routing is required for [test_content]/[scorearticle]/[0-news-trending-movavg-scorearticle-ZW0382A001S00]; Bailing out..

Looking at the code, it does not look like the routing or parent of the documents being deleted is used when creating the bulk delete requests:

github.com

elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L487


}
String scanQuery = sb.toString();
ScrollReader scrollReader = new ScrollReader(new ScrollReaderConfig(new JdkValueReader()));


// start iterating
ScrollQuery sq = scanAll(scanQuery, null, scrollReader);
try {
    BytesArray entry = new BytesArray(0);


    // delete each retrieved batch
    String format = "{\"delete\":{\"_id\":\"%s\"}}\n";
    while (sq.hasNext()) {
        entry.reset();
        entry.add(StringUtils.toUTF(String.format(format, sq.next()[0])));
        writeProcessedToIndex(entry);
    }


    flush();
    // once done force a refresh
    client.refresh(resources.getResourceWrite());
} finally {

Is there any way to get around this? Or do we need to look at deleting records ourselves prior to writing.

james.baiera · August 1, 2017, 9:18pm

Yeah, that would indeed be the case here. Thanks for bringing this to our attention. Could you open an issue on the Github project with this information?

Brendan_Kerrison · August 1, 2017, 10:29pm

Will do.

Brendan_Kerrison · August 1, 2017, 10:42pm

system · August 29, 2017, 10:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error with pyspark connect es Elasticsearch es-hadoop	1	903	September 24, 2020
Pyspark Dataframe Save Error Elasticsearch es-hadoop	4	2398	October 5, 2017
Connection Spark and ElasticSearch Elasticsearch es-hadoop	3	3281	August 27, 2017
Unable to index the document through ES-Hadoop(Spark) : In local mode it is working ,but from cluster it is not working Elasticsearch es-hadoop	2	2024	January 14, 2020
How to write to ES from a pyspark dataframe? Elasticsearch es-hadoop	5	5122	July 6, 2017

Unable to overwrite dataframe when documents have parent

Related topics