Any idea why data sent through df.write. in pyspark the data doesn't match correctly . in the backend the data is correct.
Can you provide your index mappings and a pyspark script to reproduce this?
can you be more specific on the differences you see?
Hi, so
this is a sample hive code
create table db.sample
(id string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample select * from data1;
create table db.sample_2
(id string,
status string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample_2 select * from data2;
and this is my sample spark code
df_1 = data1.select("id","count","time")
df_2 = data2.select("id","status","count","time")
df_1.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)
df_2.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)
I am using spark 2.4.4 rn .
So the issue that i see is whenever i run my spark code each successive time either some data gets duplicated or is missing .
No problem with hive. I am using elasticsearch hadoop v8 jar for this.
Currently since i had a deadline i am now doing processing in spark saving to a temp table and then using hive to transfer the data. IDk why the spark script didn't work. Also i have like 8 dataframes which i am inserting but the data quantity is small . you can assume 450 to 1500 rows and max i think 3000 rows
elasticsearch wil insert data as it comes with _id autogenerated. are you pushing same record again? if so you have duplicate in elastic.
you need to put lot more info to understand what is going on.
no but the schema for each df was different . I found a workaround , where i am now inserting data to a temporary table in spark and then creating a external table in hive with elasticsearch as storage and writing data to it using select statement.
Still not sure why i was getting mismatches when sending data directly from spark