Hi everyone,
I am using HDFS to store data (files1.txt), and load them into a Hive table (user_source). Then I create a external table as below:
create external table user_tr( id int,name string)
row format delimited fields terminated by ','
lines terminated by '\n'
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
location '/user/hive/warehouse/user_source'
tblproperties ( 'es.resource'='/blogs/user/',
'es.index.auto.create'='true',
'es.nodes'='172.x.y.z',
'es.port'='7200',
'es.mapping.names'='id:id,name:name');
and the table is created successfully.
However, in the /_plugin/head/, I cannot see the type created automaticlly under the 'blogs' index.
Of course, if i create an external table stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler', and do command :
"insert overwrite table user_try
select * from user_source;". There will be a type created and I can see all data is loaded in the ES.
Is that because the location command is not applicable? If so, how can I automatically load data into ES without create a external table and insert data into the table from a Hive table?
Looking forward for reply.
PS: the whole problem is about: I used sqoop to load data from mysql to HDFS, it worked. Then I wanted to create a external table mapped to the HDFS location.
So, I can avoid creating a local Hive table, and use 'insert' to load data from the Hive table to the external one.
But, every time I sqoop more data, the es index size increases.
Hello there!
When creating a table in Hive that uses the EsStorageHandler, Hive is creating what the documentation calls an "external non-native" table. Much of the interaction is delegated to the underlying storage handler.
When you create an external table using an HDFS directory, Hive creates an "external native" table. In this scenario, Hive knows how to interact with the underlying store on its own, it just knows that it's not the sole owner of the data being used.
Using an external table is just placing Hive on top of existing data. This essentially means that which ever storage handler you've decided to use becomes the primary backing store that Hive executes against. In this case, you have HDFS and Elasticsearch as two storage locations that Hive has more or less "mounted".
When you insert data into an HDFS location that is configured as an external table for Hive, it's as if you're inserting the data directly into the datastore. This would be as if you were to index data directly into Elasticsearch. Hive would see the data if you queried it, but had no role in the insertion. This is why pulling data with Sqoop to HDFS works for the HDFS backed table, but not the Elasticsearch one. Elasticsearch is not backed by HDFS, which is why the location syntax is ignored, and why data that is pulled into that directory is not picked up.
If you are looking to publish data into Elasticsearch using just Hive, you'll have to do so through the INSERT
commands. One thing you may want to do is notify the EsStorageHandler which field should be considered the document id via the es.mapping.id
setting in the TBLPROPERTIES
of the table.
Alternatively, you could index the data from the source RDBMS directly into Elasticsearch using a something other than Hive and still query the data using Hive for other use cases. Perhaps Sqoop'ing the data to a different HDFS directory per execution, and running a pig/mr/spark job to insert just the single incremental extract?
Hope this helps!
2 Likes
Cheers, indeed helpful. I will try to find out if I can use pig,MR,Spark to achieve the incremental need.