Hi, so
this is a sample hive code
create table db.sample
(id string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample select * from data1;
create table db.sample_2
(id string,
status string,
count bigint,
time timestamp)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties(
'es.nodes.wan.only'='true',
'es.nodes'=esnode,
'es.resource'=index,
'es.mapping.names'='time:@timestamp');
Insert into table db.sample_2 select * from data2;
and this is my sample spark code
df_1 = data1.select("id","count","time")
df_2 = data2.select("id","status","count","time")
df_1.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)
df_2.write.format("org.elasticsearch.spark.sql")\
.option('es.nodes.wan.only','true')\
.option('es.nodes',es_node)\
.option('es.resource',index)\
.option('es.mapping.names','time:@timestamp')\
.mode('append')\
.save(index)
I am using spark 2.4.4 rn .
So the issue that i see is whenever i run my spark code each successive time either some data gets duplicated or is missing .
No problem with hive. I am using elasticsearch hadoop v8 jar for this.
Currently since i had a deadline i am now doing processing in spark saving to a temp table and then using hive to transfer the data. IDk why the spark script didn't work. Also i have like 8 dataframes which i am inserting but the data quantity is small . you can assume 450 to 1500 rows and max i think 3000 rows