So i have a hive script which creates multiple external table to one index and stores the data there. Later on we visualise that data on ES.
While converting the code from hive to pyspark i notice that while running the same pyspark job the data is not matching. not for all values but some values.
What could be the reason for it? I can guess that since in hive we have multiple tables so data is not mixed whereas in pyspark i am sending all datraframes t the same index. BUt if i run the same pyspark job twice i get data mismatch.
At first i thought my code was bad but the spark and hive code return same values if i run them manually in terminal.