Data mismatches happening while sending data to Elastic Search index using pyspark

yolo1 · April 20, 2025, 2:06pm

Hi, so
this is a sample hive code

create table db.sample
(id string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample select * from data1;

create table db.sample_2
(id string,
status string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample_2 select * from data2;

and this is my sample spark code

df_1 = data1.select("id","count","time")
df_2 = data2.select("id","status","count","time")
df_1.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)

df_2.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)

I am using spark 2.4.4 rn .
So the issue that i see is whenever i run my spark code each successive time either some data gets duplicated or is missing .

No problem with hive. I am using elasticsearch hadoop v8 jar for this.

Currently since i had a deadline i am now doing processing in spark saving to a temp table and then using hive to transfer the data. IDk why the spark script didn't work. Also i have like 8 dataframes which i am inserting but the data quantity is small . you can assume 450 to 1500 rows and max i think 3000 rows

Topic		Replies	Views
Data in elastic search doesn't match while running the same job Kibana	3	48	April 21, 2025
Writing Spark Dataframe into ElasticSeach- Runs Successfully but Not all Data dumped Elasticsearch es-hadoop	2	1362	January 4, 2022
Duplicates result with elasticsearch hadoop spark Elasticsearch es-hadoop	2	1004	May 25, 2017
Duplicate rows Elasticsearch es-hadoop	4	2300	March 27, 2017
Weird behavior when indexing from spark Elasticsearch es-hadoop	1	703	May 16, 2017

Data mismatches happening while sending data to Elastic Search index using pyspark

Related topics