Duplicate insertions for a single document

Pat_Humphreys · October 3, 2016, 3:53pm

I have a Cascading job that is reading a file of Json docs from S3 (1.5GB) and inserting it into Elasticsearch. I have 5888755 records in the file, but when I do a count in the elasticsearch index it is 6213755 ( this number varies by 100/200k records each time I run it ). The job completes correctly there are no errors on the Hadoop or Elasticsearch logs and the ES hadoop counters all have the right value. So I am not clear where the duplicates are coming from, but it looks like a bulk insert is being run more than once. I have seen the same documents appear up to 9 times. The strange thing is when I split the 1.5GB file into 5 files on and use one core EMR node the count is correct every time, but if I use 2 core nodes the count is off.

ES Hadoop Counters

        "org.elasticsearch.hadoop.mr.Counter": {
           "SCROLL_TOTAL": 0,
           "BYTES_ACCEPTED": 1747584120,
           "BYTES_RETRIED": 0,
           "DOCS_RETRIED": 0,
           "BYTES_RECEIVED": 4752000,
           "DOCS_RECEIVED": 0,
           "BULK_TOTAL_TIME_MS": 2093520,
           "BULK_RETRIES": 0,
           "DOCS_ACCEPTED": 5888755,
           "NET_RETRIES": 0,
           "BULK_TOTAL": 1188,
           "NET_TOTAL_TIME_MS": 2096311,
           "SCROLL_TOTAL_TIME_MS": 0,
           "BULK_RETRIES_TOTAL_TIME_MS": 0,
           "NODE_RETRIES": 0,
           "DOCS_SENT": 5888755,
           "BYTES_SENT": 1747584120

        },

Versions
ES Version 2.2.0
ES Hadoop Version: 2.2.0
Hadoop 2.7.0

james.baiera · October 5, 2016, 3:44pm

I'm assuming that you are using Amazon's EMR for the execution (or something similar)? This normally occurs when the MR environment has speculative execution turned on. The best route to get around this is to either turn off speculative execution, or if that is not possible, to assign a stable ID to each record that you are inserting (either a hash digest of the record, a composition of fields, etc.).

For more information on speculative execution woes take a look at this section of the docs.

Pat_Humphreys · October 5, 2016, 4:52pm

Hi thanks for getting back to me. I am using EMR and I did have speculative execution turned on, I turned it off and it works fine. Also is it not strange that the DOCS_ACCEPTED counter is less than the index count? Surely this is a bug as Hadoop has sent more then that number?

Pat_Humphreys · October 6, 2016, 8:53am

Just another follow up for anyone who has the same problem those MR properties specified in the link above have now been deprecated and you should use these in newer versions of MR:
mapred.map.tasks.speculative.execution -> mapreduce.map.speculative
mapred.reduce.tasks.speculative.execution -> mapreduce.reduce.speculative
Have a look here
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

Topic		Replies	Views
Duplicate documents get inserted when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	4	1696	January 11, 2018
Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	3	947	June 27, 2018
Wrong documents' count after inserting Elasticsearch	5	209	May 25, 2023
Bulk master failure behaviour Elasticsearch	4	547	July 5, 2017
Wrong number of docs in elasticsearch Elasticsearch	2	369	April 26, 2018

Duplicate insertions for a single document

Related topics