ES-Hadoop 2.0.2 jars and INSERT OVERWRITE


#1

Are there known issues with the 2.0.2 jar and INSERT OVERWRITE? I have defined my external table and can insert into my ES index and everything works like a charm.

However, I notice that even when I specify OVERWRITE that the index is always appended. I can of course delete the index before I start, but I would prefer to be able to do from within the Hive context.

Thanks in advance, Andrew

P.S. I'm using ES 1.5.0, Hive 0.10 (and CDH 4.7).


(Costin Leau) #2

Hive doesn't expose the overwrite to external tables. So a different provider is not aware whether an insert is normal or not.
Further more, the SQL semantics are somewhat different - in some case INSERT OVERWRITE removes the entire data set but more often than not, only overwrites the entries specified.
Thus the object identity need to be defined which is handled by the connector directly, regardless of the OVERWRITE or not. In other words, if the write operation is update vs index vs create and the document id is specified, the behaviour can be tweak per entry/doc-level which is typically what one wants.
If not, one can simply drop the index before insertion.

Hope this helps,


#3

Hallo Costin,

Thanks for your reply. The standard Hive behaviour with internal tables (certainly with partitions) is that Hive empties the target location/partition and writes the new data afresh. With EXTERNAL TABLES that is a grey area as the data does not strictly "belong" to Hive, so I can understand why a delete does not happen (I would think the most elegant implementation would be to control this behaviour via a property, as is done with e.g.'es.index.auto.create', although DELETEs are there in Hive 0.14).

But at least I know that is expected behaviour! Thanks for your help.

Andrew


ES-Hadoop overwrite the external table
(Costin Leau) #4

The problem with properties is that they are table defined. While OVERWRITE is query defined - there's no way for a TABLE to know whether an INSERT is actually OVERWRITE or not and deleting the index on each INSERT is not a solution.
The only way to fix this, especially for destructive operations like DELETE is to tell the STORAGE about what operation to execute instead of letting it execute it. As a side note, Spark SQL offers such a hook which the connector plugs into and thus understands when an OVERWRITE is happening and thus trigger an index delete.

From the connector perspective, having such an interface (along with proper pushdown operations) would be great since ultimately will create a better integration and richer experience for using Elasticsearch in Hive.


(system) #5