Duplicates and missing documents during bulk retrieval

ashsskum · May 6, 2019, 9:57pm

When we identify events corresponding to a particular device, we get all the events associated with it, using Spark elasticsearch sql

However, when we try to retrieve the complete index data (there will be on index created each day), there are duplicate and missing events using Spark elasticsearch sql. Code below. Version: 5.1.1. Job being executed on 5 exectors node, with 2 core each. Any suggestions?

val spark = org.apache.spark.sql.SparkSession.builder().appName("Sample").enableHiveSupport().getOrCreate()
import spark.implicits._

val df = spark.read.format("org.elasticsearch.spark.sql").
  option("es.nodes", "xxxxxxx:9200").
  option("es.read.metadata", "true").
  option("es.nodes.wan.only", "true").	  
  option("es.read.field.as.array.include", "tags").
  option("es.scroll.size", 1000).
  load("dvclogs-2019.05.05")
  df.write.mode(SaveMode.Overwrite).parquet("/tmp/es/")

system · June 3, 2019, 9:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicates result with elasticsearch hadoop spark Elasticsearch es-hadoop	2	1003	May 25, 2017
Duplicate in Dataset while reading from elasticsearch index with SPARK Elasticsearch es-hadoop	1	713	May 9, 2019
Weird behavior when indexing from spark Elasticsearch es-hadoop	1	703	May 16, 2017
Use cases Elasticsearch and Spark Elasticsearch es-hadoop	5	3394	July 6, 2017
Duplicate rows Elasticsearch es-hadoop	4	2299	March 27, 2017

Duplicates and missing documents during bulk retrieval

Related topics