Weird behavior when indexing from spark

Hi all,
my setup is like this:

  • Spark v1.6.0 (CDH 5.10)
  • Elastic v2.3
  • es-hadoop: v 2.3.2

My spark program is something like this:

val result = sqlContext.sql("select * from tablename") result.saveToEs("index/type", cfg) val es_count=read document count from ES val hdfs_count=result.count() if (es_count!=hdfs_count) then throw Exception

Here is the weird behavior (it does not appear sistematically):

  • the job runs with no problems (i.e., tablename is indexed correctly and no exception is thrown);
  • after few seconds a few (duplicated) records are added to the index even if the job is supposed to be finished. It follows that es_count>hdfs_count but the duplicated records are added after the if (es_count!=hdfs_count) check so no exception is fired.

I thought it was something related to task re-execution but I don't see any failed task. I also tried to set spark.task.maxFailures=1 in Spark but nothing changes.

Any suggestion?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.