HIVE to Elasticsearch via PIG: Error-Requires ID but none was given/found

rcbot · November 24, 2017, 4:31pm

I have a HIVE table(not external) called "default.reading" and it contains 2 columns called "name"(String) and "favoritebook"(Array of String) with the following data:

name | favoritebook

Tom | ["Book1" , "Book2"]
Abby | ["Book3" , "Book4"]

Note that there is no ID in table and the 2 columns are all we have. I am trying to insert above into elasticsearch with this PIG script

-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.pig.tuple.use.field.names=true'
);

hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable GENERATE name AS name, favoritebook as favoriteBook;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------

When running above, i got an error saying: "Error: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Operation [update] requires an id but none was given/found"
**note that I've chosen to use upsert is because we can then incrementally load more data to it afterwards

Can you see if there's any issue with my pig script? am i doing anything wrong?
If adding 'es.mapping.id' is a must for above, what should the value be as there's no ID in the HIVE table?
favoritebook is an Array of string, i think i have seen below error so i wonder how i can express it in the script?
ERROR 2999: Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [favoriteBook]; Bailing out..

Thank you!

james.baiera · November 28, 2017, 5:45pm

In order to use the upsert operation, one must provide the name of the field to be used as the document's ID. This is because the upsert will search for an existing document with that ID and update it if it exists, and create it if it does not. Without an ID, it cannot perform these actions. ID's should be unique across all documents in your index/type. If you do not have an ID that can be used for each document or would like to depend on the auto-creation of an ID by elasticsearch, then you should use the index operation. Note that the ID's will be created on Elasticsearch's side and may be difficult to update/overlay data in place.

rcbot · November 29, 2017, 1:27pm

Thank you James for explaining that. By index operation do you mean "'es.index.auto.create' = 'true'" ?
Do you also have an example of how i can insert an Array of String into elasticsearch using PIG?
Thanks!

james.baiera · November 30, 2017, 9:50pm

I mean 'es.write.operation = index'.

You should be able to insert an array of strings into Elasticsearch by using a bag datatype in Pig with string data.

system · December 28, 2017, 9:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting _id field in elasticsearch to map to a field in HIVE Elasticsearch	4	1957	November 4, 2022
Considering bulk upserts from hadoop [Hadoop] Elasticsearch	8	534	July 6, 2017
[Hadoop][pig] How to set the document id? Elasticsearch	6	517	July 6, 2017
Pushing data from Hive to Elastic Search Elasticsearch	15	1435	July 6, 2017
[Hadoop][pig] How to set the document id in ESStorage? Elasticsearch	3	409	July 6, 2017

HIVE to Elasticsearch via PIG: Error-Requires ID but none was given/found

I have a HIVE table(not external) called "default.reading" and it contains 2 columns called "name"(String) and "favoritebook"(Array of String) with the following data:

name | favoritebook

Related topics