Data in elastic search doesn't match while running the same job

yolo1 · April 13, 2025, 2:24pm

So i have a hive script which creates multiple external table to one index and stores the data there. Later on we visualise that data on ES.

While converting the code from hive to pyspark i notice that while running the same pyspark job the data is not matching. not for all values but some values.

What could be the reason for it? I can guess that since in hive we have multiple tables so data is not mixed whereas in pyspark i am sending all datraframes t the same index. BUt if i run the same pyspark job twice i get data mismatch.
At first i thought my code was bad but the spark and hive code return same values if i run them manually in terminal.

jessgarson · April 21, 2025, 7:12pm

Thanks for reaching out here. A few follow-up questions here:

What version of Elastic are you using?
Do you have a code example you can share with us?
I'd also like to learn more about exactly what isn't matching.

Best,

Jessica

elasticforme · April 21, 2025, 7:27pm

this is duplicate post of

https://discuss.elastic.co/t/data-mismatches-happening-while-sending-data-to-elastic-search-index-using-pyspark/377188/4

jessgarson · April 21, 2025, 7:56pm

Thanks, @elasticforme. Let's use that post for any further conversation on that subject.

Topic		Replies	Views
Data mismatches happening while sending data to Elastic Search index using pyspark Elasticsearch datastreams	5	71	May 19, 2025
Weird behavior when indexing from spark Elasticsearch es-hadoop	1	709	May 16, 2017
Writing Spark Dataframe into ElasticSeach- Runs Successfully but Not all Data dumped Elasticsearch es-hadoop	2	1386	January 4, 2022
Duplicates result with elasticsearch hadoop spark Elasticsearch es-hadoop	2	1007	May 25, 2017
Elasticsearch with hive Elasticsearch	1	308	June 19, 2018

Data in elastic search doesn't match while running the same job

Related topics