Duplicates and missing documents during bulk retrieval

When we identify events corresponding to a particular device, we get all the events associated with it, using Spark elasticsearch sql

However, when we try to retrieve the complete index data (there will be on index created each day), there are duplicate and missing events using Spark elasticsearch sql. Code below. Version: 5.1.1. Job being executed on 5 exectors node, with 2 core each. Any suggestions?

val spark = org.apache.spark.sql.SparkSession.builder().appName("Sample").enableHiveSupport().getOrCreate()
import spark.implicits._

val df = spark.read.format("org.elasticsearch.spark.sql").
  option("es.nodes", "xxxxxxx:9200").
  option("es.read.metadata", "true").
  option("es.nodes.wan.only", "true").	  
  option("es.read.field.as.array.include", "tags").
  option("es.scroll.size", 1000).
  load("dvclogs-2019.05.05")
  df.write.mode(SaveMode.Overwrite).parquet("/tmp/es/")

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.