Truncate or update Elastic Hive tables


#1

Hi, all. I am using Elasticsearch Hive integration, so that I can query from Hadoop tables, sending alerts when data is bad (with ElastAlert), as well as display on Kibana.

This is how I created the Elastic table:

CREATE EXTERNAL TABLE my_elastic_table (
timestamp BIGINT,
count BIGINT
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'my_index/my_type);

And I inserted into Elastic Hive table with:

INSERT OVERWRITE TABLE my_elastic_table
SELECT {something} FROM my_hadoop_table;

However, it didn't really > OVERWRITE elastic_table, it actually appended to elastic_table. So I tried to TRUNCATE elastic_table, and it gave me the following error:

FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table elastic_table.

So I am asking if anyone know how to truncate, update or overwrite Elastic Hive tables. Or is there any better way to deal with this kind of problem. Thank you!


(Costin Leau) #2

If you want to delete data, it's best to do it directly in Elasticsearch.
TRUNCATE and other delete operations are not supported in Hive since manly they assume files on HDFS (or the so-called managed tables`). Which is not the case with ES.

Deleting content is ES is quite easy - it can be a simple curl call for a given index.


#3

Hi Costin, thanks for your reply.

Instead of delete, is UPDATE supported? From Apache Hive integration Documentation. I see insert overwrite hive tables. However, as I tested a lot, it didn't really OVERWRITE the tables. Can I ask the reason for using insert overwrite in the documentation? Thanks a lot!


(Costin Leau) #4

INSERT OVERWRITE is a bit of a misnomer. Due to the way Hive works OVERWRITE simply tells it to not pay attention to the table metadata (if its exists). Internally however the data is not updated; further more as a Hive adaptor, ES-Hadoop is unaware of whether INSERT or INSERT OVERWRITE was being used; Hive infrastructure doesn't provide any information on this front.

As a comparison Spark SQL for example does support different update modes (SaveModes as they are called).
Due to API limitations, one is best in accessing ES directly when doing data management and use Hive for reads and writes.

Hope this helps,


(Costin Leau) #5

By the way, if you want to do updates see es.write.operation parameter which instructs ES-Hadoop to update data in ES instead of just indexing it as mentioned here.


(system) #6