Hive JSON integration and Deprecated _ID capability

(Chris) #1

How do you explicitly set the _ID field when indexing via HIVE using the "es.input.json" option? in 1.7 we were able to use the PATH command to extract the ID from the corresponding JSON record. However, now I'm unsure how to use data from the JSON for the ID.... or if you can. I tried cheating and giving the mapping command a json path (a la the RESOURCE option for document type), but no joy.


(Costin Leau) #2

See the mapping options. Just add this to your table configuration.

(Chris) #3

I tried the originally, but it expects a HIVE field to exist (hence the map). I was trying to maintain the JSON only setting while pulling an ID from the JSON string. In reality, it doesn't really matter. Having ES automatically create the _ID field is okay, I'll just need to go into the JSON for any queries on the real ID. Still, it would be nice to be able to push this up using the hadoop libraries.

(Costin Leau) #4

The field extractor works on both JSON and non-JSON/Hive input. What was your setup and error?

(Chris) #5

Looks like I made a mistake. I tried to set the like the es.resource with {} around the field name. Changing it to just the field_name (in the JSON) worked. I didn't find anything in the documentation that mentioned format.

e.g. for a JSON string that looked like:
{"id":"xxxx", "item": {"key1": 1, "key2",2}}
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test/doc',
'' = 'id',
'es.input.json' ='yes');

works fine.


(Costin Leau) #6

Well, what made you think you should be using {. The connector can extract from both JSON or library types hence why only the actual field name is relevant.

(system) #7