HIVE to Elasticsearch via PIG: Error-Requires ID but none was given/found

I have a HIVE table(not external) called "default.reading" and it contains 2 columns called "name"(String) and "favoritebook"(Array of String) with the following data:

name | favoritebook

Tom | ["Book1" , "Book2"]
Abby | ["Book3" , "Book4"]

Note that there is no ID in table and the 2 columns are all we have. I am trying to insert above into elasticsearch with this PIG script

-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.pig.tuple.use.field.names=true'
);

hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable GENERATE name AS name, favoritebook as favoriteBook;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------

When running above, i got an error saying: "Error: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Operation [update] requires an id but none was given/found"
**note that I've chosen to use upsert is because we can then incrementally load more data to it afterwards

  1. Can you see if there's any issue with my pig script? am i doing anything wrong?
  2. If adding 'es.mapping.id' is a must for above, what should the value be as there's no ID in the HIVE table?
  3. favoritebook is an Array of string, i think i have seen below error so i wonder how i can express it in the script?
    ERROR 2999: Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [favoriteBook]; Bailing out..

Thank you!

In order to use the upsert operation, one must provide the name of the field to be used as the document's ID. This is because the upsert will search for an existing document with that ID and update it if it exists, and create it if it does not. Without an ID, it cannot perform these actions. ID's should be unique across all documents in your index/type. If you do not have an ID that can be used for each document or would like to depend on the auto-creation of an ID by elasticsearch, then you should use the index operation. Note that the ID's will be created on Elasticsearch's side and may be difficult to update/overlay data in place.

Thank you James for explaining that. By index operation do you mean "'es.index.auto.create' = 'true'" ?
Do you also have an example of how i can insert an Array of String into elasticsearch using PIG?
Thanks!

I mean 'es.write.operation = index'.

You should be able to insert an array of strings into Elasticsearch by using a bag datatype in Pig with string data.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.