When we identify events corresponding to a particular device, we get all the events associated with it, using Spark elasticsearch sql
However, when we try to retrieve the complete index data (there will be on index created each day), there are duplicate and missing events using Spark elasticsearch sql. Code below. Version: 5.1.1. Job being executed on 5 exectors node, with 2 core each. Any suggestions?
val spark = org.apache.spark.sql.SparkSession.builder().appName("Sample").enableHiveSupport().getOrCreate()
import spark.implicits._
val df = spark.read.format("org.elasticsearch.spark.sql").
option("es.nodes", "xxxxxxx:9200").
option("es.read.metadata", "true").
option("es.nodes.wan.only", "true").
option("es.read.field.as.array.include", "tags").
option("es.scroll.size", 1000).
load("dvclogs-2019.05.05")
df.write.mode(SaveMode.Overwrite).parquet("/tmp/es/")