I am writing data from my spark-dataframe into ES. i did print the schema and the total count of records and it seems all ok until the dump gets started. Job runs successfully and no issue /error raised in spark job but the index doesn't have the supposed amount of data it should have.
i have 1800k records needs to dump and sometimes it dumps only 500k , sometimes 800k etc.
Here is main section of code.
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config('spark.yarn.executor.memoryOverhead', '4096') \
.enableHiveSupport() \
.getOrCreate()
final_df = spark.read.load("/trans/MergedFinal_stage_p1", multiline="false", format="json")
print(final_df.count()) # It is perfectly ok
final_df.printSchema() # Schema is also ok
## Issue when data gets write in DB ##
final_df.write.mode("ignore").format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
My resources are also ok.
Command to run spark job.
time spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 6g --executor-memory 3g --num-executors 16 --executor-cores 2 main_es.py