Duplicate insertions for a single document

I have a Cascading job that is reading a file of Json docs from S3 (1.5GB) and inserting it into Elasticsearch. I have 5888755 records in the file, but when I do a count in the elasticsearch index it is 6213755 ( this number varies by 100/200k records each time I run it ). The job completes correctly there are no errors on the Hadoop or Elasticsearch logs and the ES hadoop counters all have the right value. So I am not clear where the duplicates are coming from, but it looks like a bulk insert is being run more than once. I have seen the same documents appear up to 9 times. The strange thing is when I split the 1.5GB file into 5 files on and use one core EMR node the count is correct every time, but if I use 2 core nodes the count is off.

ES Hadoop Counters

        "org.elasticsearch.hadoop.mr.Counter": {
           "SCROLL_TOTAL": 0,
           "BYTES_ACCEPTED": 1747584120,
           "BYTES_RETRIED": 0,
           "DOCS_RETRIED": 0,
           "BYTES_RECEIVED": 4752000,
           "DOCS_RECEIVED": 0,
           "BULK_TOTAL_TIME_MS": 2093520,
           "BULK_RETRIES": 0,
           "DOCS_ACCEPTED": 5888755,
           "NET_RETRIES": 0,
           "BULK_TOTAL": 1188,
           "NET_TOTAL_TIME_MS": 2096311,
           "SCROLL_TOTAL_TIME_MS": 0,
           "BULK_RETRIES_TOTAL_TIME_MS": 0,
           "NODE_RETRIES": 0,
           "DOCS_SENT": 5888755,
           "BYTES_SENT": 1747584120

        },

Versions
ES Version 2.2.0
ES Hadoop Version: 2.2.0
Hadoop 2.7.0

I'm assuming that you are using Amazon's EMR for the execution (or something similar)? This normally occurs when the MR environment has speculative execution turned on. The best route to get around this is to either turn off speculative execution, or if that is not possible, to assign a stable ID to each record that you are inserting (either a hash digest of the record, a composition of fields, etc.).

For more information on speculative execution woes take a look at this section of the docs.

1 Like

Hi thanks for getting back to me. I am using EMR and I did have speculative execution turned on, I turned it off and it works fine. Also is it not strange that the DOCS_ACCEPTED counter is less than the index count? Surely this is a bug as Hadoop has sent more then that number?

Just another follow up for anyone who has the same problem those MR properties specified in the link above have now been deprecated and you should use these in newer versions of MR:
mapred.map.tasks.speculative.execution -> mapreduce.map.speculative
mapred.reduce.tasks.speculative.execution -> mapreduce.reduce.speculative
Have a look here
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html