Data mismatches happening while sending data to Elastic Search index using pyspark

yolo1 · April 16, 2025, 11:07am

Any idea why data sent through df.write. in pyspark the data doesn't match correctly . in the backend the data is correct.

Keith_Massey · April 17, 2025, 2:46pm

Can you provide your index mappings and a pyspark script to reproduce this?

RainTown · April 17, 2025, 5:52pm

can you be more specific on the differences you see?

yolo1 · April 20, 2025, 2:06pm

Hi, so
this is a sample hive code

create table db.sample
(id string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample select * from data1;

create table db.sample_2
(id string,
status string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample_2 select * from data2;

and this is my sample spark code

df_1 = data1.select("id","count","time")
df_2 = data2.select("id","status","count","time")
df_1.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)

df_2.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)

I am using spark 2.4.4 rn .
So the issue that i see is whenever i run my spark code each successive time either some data gets duplicated or is missing .

No problem with hive. I am using elasticsearch hadoop v8 jar for this.

Currently since i had a deadline i am now doing processing in spark saving to a temp table and then using hive to transfer the data. IDk why the spark script didn't work. Also i have like 8 dataframes which i am inserting but the data quantity is small . you can assume 450 to 1500 rows and max i think 3000 rows

elasticforme · April 21, 2025, 7:25pm

elasticsearch wil insert data as it comes with _id autogenerated. are you pushing same record again? if so you have duplicate in elastic.

you need to put lot more info to understand what is going on.

yolo1 · May 19, 2025, 10:24am

no but the schema for each df was different . I found a workaround , where i am now inserting data to a temporary table in spark and then creating a external table in hive with elasticsearch as storage and writing data to it using select statement.

Still not sure why i was getting mismatches when sending data directly from spark

Topic		Replies	Views
Data in elastic search doesn't match while running the same job Kibana	3	46	April 21, 2025
Duplicates result with elasticsearch hadoop spark Elasticsearch es-hadoop	2	1003	May 25, 2017
Duplicate rows Elasticsearch es-hadoop	4	2299	March 27, 2017
Weird behavior when indexing from spark Elasticsearch es-hadoop	1	703	May 16, 2017
Writing Spark Dataframe into ElasticSeach- Runs Successfully but Not all Data dumped Elasticsearch es-hadoop	2	1355	January 4, 2022

Data mismatches happening while sending data to Elastic Search index using pyspark

Related topics