TO map hive table into ES

Hello there!

When creating a table in Hive that uses the EsStorageHandler, Hive is creating what the documentation calls an "external non-native" table. Much of the interaction is delegated to the underlying storage handler.

When you create an external table using an HDFS directory, Hive creates an "external native" table. In this scenario, Hive knows how to interact with the underlying store on its own, it just knows that it's not the sole owner of the data being used.

Using an external table is just placing Hive on top of existing data. This essentially means that which ever storage handler you've decided to use becomes the primary backing store that Hive executes against. In this case, you have HDFS and Elasticsearch as two storage locations that Hive has more or less "mounted".

When you insert data into an HDFS location that is configured as an external table for Hive, it's as if you're inserting the data directly into the datastore. This would be as if you were to index data directly into Elasticsearch. Hive would see the data if you queried it, but had no role in the insertion. This is why pulling data with Sqoop to HDFS works for the HDFS backed table, but not the Elasticsearch one. Elasticsearch is not backed by HDFS, which is why the location syntax is ignored, and why data that is pulled into that directory is not picked up.

If you are looking to publish data into Elasticsearch using just Hive, you'll have to do so through the INSERT commands. One thing you may want to do is notify the EsStorageHandler which field should be considered the document id via the es.mapping.id setting in the TBLPROPERTIES of the table.

Alternatively, you could index the data from the source RDBMS directly into Elasticsearch using a something other than Hive and still query the data using Hive for other use cases. Perhaps Sqoop'ing the data to a different HDFS directory per execution, and running a pig/mr/spark job to insert just the single incremental extract?

Hope this helps!

2 Likes