I have a Cascading job that is reading a file of Json docs from S3 (1.5GB) and inserting it into Elasticsearch. I have 5888755 records in the file, but when I do a count in the elasticsearch index it is 6213755 ( this number varies by 100/200k records each time I run it ). The job completes correctly there are no errors on the Hadoop or Elasticsearch logs and the ES hadoop counters all have the right value. So I am not clear where the duplicates are coming from, but it looks like a bulk insert is being run more than once. I have seen the same documents appear up to 9 times. The strange thing is when I split the 1.5GB file into 5 files on and use one core EMR node the count is correct every time, but if I use 2 core nodes the count is off.
ES Hadoop Counters
"org.elasticsearch.hadoop.mr.Counter": {
"SCROLL_TOTAL": 0,
"BYTES_ACCEPTED": 1747584120,
"BYTES_RETRIED": 0,
"DOCS_RETRIED": 0,
"BYTES_RECEIVED": 4752000,
"DOCS_RECEIVED": 0,
"BULK_TOTAL_TIME_MS": 2093520,
"BULK_RETRIES": 0,
"DOCS_ACCEPTED": 5888755,
"NET_RETRIES": 0,
"BULK_TOTAL": 1188,
"NET_TOTAL_TIME_MS": 2096311,
"SCROLL_TOTAL_TIME_MS": 0,
"BULK_RETRIES_TOTAL_TIME_MS": 0,
"NODE_RETRIES": 0,
"DOCS_SENT": 5888755,
"BYTES_SENT": 1747584120
},
Versions
ES Version 2.2.0
ES Hadoop Version: 2.2.0
Hadoop 2.7.0