ES-Hadoop 2.0.2 jars and INSERT OVERWRITE

aken · June 22, 2015, 1:08pm

Are there known issues with the 2.0.2 jar and INSERT OVERWRITE? I have defined my external table and can insert into my ES index and everything works like a charm.

However, I notice that even when I specify OVERWRITE that the index is always appended. I can of course delete the index before I start, but I would prefer to be able to do from within the Hive context.

Thanks in advance, Andrew

P.S. I'm using ES 1.5.0, Hive 0.10 (and CDH 4.7).

costin · June 22, 2015, 1:24pm

Hive doesn't expose the overwrite to external tables. So a different provider is not aware whether an insert is normal or not.
Further more, the SQL semantics are somewhat different - in some case INSERT OVERWRITE removes the entire data set but more often than not, only overwrites the entries specified.
Thus the object identity need to be defined which is handled by the connector directly, regardless of the OVERWRITE or not. In other words, if the write operation is update vs index vs create and the document id is specified, the behaviour can be tweak per entry/doc-level which is typically what one wants.
If not, one can simply drop the index before insertion.

Hope this helps,

aken · June 22, 2015, 1:32pm

Hallo Costin,

Thanks for your reply. The standard Hive behaviour with internal tables (certainly with partitions) is that Hive empties the target location/partition and writes the new data afresh. With EXTERNAL TABLES that is a grey area as the data does not strictly "belong" to Hive, so I can understand why a delete does not happen (I would think the most elegant implementation would be to control this behaviour via a property, as is done with e.g.'es.index.auto.create', although DELETEs are there in Hive 0.14).

But at least I know that is expected behaviour! Thanks for your help.

Andrew

costin · June 22, 2015, 1:54pm

The problem with properties is that they are table defined. While OVERWRITE is query defined - there's no way for a TABLE to know whether an INSERT is actually OVERWRITE or not and deleting the index on each INSERT is not a solution.
The only way to fix this, especially for destructive operations like DELETE is to tell the STORAGE about what operation to execute instead of letting it execute it. As a side note, Spark SQL offers such a hook which the connector plugs into and thus understands when an OVERWRITE is happening and thus trigger an index delete.

From the connector perspective, having such an interface (along with proper pushdown operations) would be great since ultimately will create a better integration and richer experience for using Elasticsearch in Hive.

Topic		Replies	Views
ES-Hadoop overwrite the external table Elasticsearch es-hadoop	5	1459	November 1, 2017
Insert Overwrite is not working through Hive-Elasticsearch Integration after multiple data import jobs through hive Elasticsearch es-hadoop	3	944	June 27, 2018
Elasticsearch AND Hive table overwrite Action Elasticsearch	1	300	August 30, 2019
Truncate or update Elastic Hive tables Elasticsearch es-hadoop	5	3918	July 6, 2017
Elastic Search Does not overwrite data from Hive overwrite insert statement Elasticsearch	1	470	April 10, 2018

ES-Hadoop 2.0.2 jars and INSERT OVERWRITE

Related topics