my setup is like this:
- Spark v1.6.0 (CDH 5.10)
- Elastic v2.3
- es-hadoop: v 2.3.2
My spark program is something like this:
val result = sqlContext.sql("select * from tablename") result.saveToEs("index/type", cfg) val es_count=read document count from ES val hdfs_count=result.count() if (es_count!=hdfs_count) then throw Exception
Here is the weird behavior (it does not appear sistematically):
- the job runs with no problems (i.e., tablename is indexed correctly and no exception is thrown);
- after few seconds a few (duplicated) records are added to the index even if the job is supposed to be finished. It follows that es_count>hdfs_count but the duplicated records are added after the
if (es_count!=hdfs_count)check so no exception is fired.
I thought it was something related to task re-execution but I don't see any failed task. I also tried to set
spark.task.maxFailures=1 in Spark but nothing changes.