Use saveJsonToEs and always keep the same _id field

(Chris) #1


I would like to know whether there's a workaround to this issue with log parsing:

Before the 2.X version of Elasticsearch, i used this scala feature:
rdd.saveJsonToEs(indexAndType) with JSON, containing an _id field.

And my ES mapping contained the _id field as a string.

This field was created with SHA256 from a log line. Therefore each log line could be in ES, and if want to reindex all data, the new line created a "version 2" of the document _id.

Now, with 2.X _id feature, i cannot use saveJsonToEs, because i am not allowed to use _id in a JSON document.

I noticed that with bulk feature, i can add an _id field:

[root@server ~]# cat all.json
{"index": {"_index": "testindex", "_type": "typeblabla", "_id": "uniq1-o"}}
{ "text": "toto" }
{"index": {"_index": "testindex", "_type": "typeblabla", "_id": "uniq1-a"}}
{ "text": "tata" }
{"index": {"_index": "testindex", "_type": "typeblabla", "_id": "uniq1-i"}}
{ "text": "titi" }

However, in scala, i just have a file with a huge number of json line:

{ "text": "toto", "_id":"uniq1"}
{ "text": "tata", "_id":"uniq2" }
{ "text": "titi", "_id":"uniq3" }
{ "text": "tutu", "_id":"uniq4" } 

How can i insert these data into ES and be sure that, if i reindex all data, they will have always the same _id ?


(Costin Leau) #2

You should be able to achieve the same result by using saveToEsWithMeta which accepts key/values (key metadata, value the doc) and by setting es.input.json to true.
If the key is determined from the doc, you can do so by setting an extractor.

(Chris) #3

thanks for your help, i'll try this :slight_smile:
i'll post here the results of these tests.

