Weird behavior when indexing from spark

steccami · April 18, 2017, 3:25pm

Hi all,
my setup is like this:

Spark v1.6.0 (CDH 5.10)
Elastic v2.3
es-hadoop: v 2.3.2

My spark program is something like this:

val result = sqlContext.sql("select * from tablename") result.saveToEs("index/type", cfg) val es_count=read document count from ES val hdfs_count=result.count() if (es_count!=hdfs_count) then throw Exception

Here is the weird behavior (it does not appear sistematically):

the job runs with no problems (i.e., tablename is indexed correctly and no exception is thrown);
after few seconds a few (duplicated) records are added to the index even if the job is supposed to be finished. It follows that es_count>hdfs_count but the duplicated records are added after the if (es_count!=hdfs_count) check so no exception is fired.

I thought it was something related to task re-execution but I don't see any failed task. I also tried to set spark.task.maxFailures=1 in Spark but nothing changes.

Any suggestion?

system · May 16, 2017, 3:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicates result with elasticsearch hadoop spark Elasticsearch es-hadoop	2	995	May 25, 2017
Inconsistent Behaviors between Different Runs of ES-Spark Connector Elasticsearch es-hadoop	3	899	May 10, 2019
Duplicate rows Elasticsearch es-hadoop	4	2294	March 27, 2017
Duplicate in Dataset while reading from elasticsearch index with SPARK Elasticsearch es-hadoop	1	711	May 9, 2019
Duplicate insertions for a single document Elasticsearch es-hadoop	4	747	July 6, 2017

Weird behavior when indexing from spark

Related topics