We have at least one data source where we have chosen to set a document id (was started years ago so don't know/remember the history on why) but we have been cleaning up our pipeline and were wondering if we could leverage a metadata field to specify the document_id but if empty it would behave the same as if the document_id wasn't specified. We are doing this with pipeline and it has helped greatly simplify the output logic, other than the document_id being present the output blocks are the same. Any thoughts?
if you use metadata field to define _id and if that is blank then you won't have that document in your index.
or should I say it will overwrite older doc.
for example
first record has metadata_id = "ben" and you have one record
second record it is empty and you will have that record inserted in index with
_id = metadata_id
now third record is empty then it will overwrite second record with third.
That's how I interpreted things too. Wanted to double check if there was something I was missing or someway to have it behave closer to that of pipeline.
ex
record1 has id 1231, gets inserted with 1231
record2 has id "", gets created without specifying id
record 3 has id "", gets created without specifying id, does not overwrite record 2 either
record2 has id 1231; follows settings of doc_as_upsert + action (or any other parameters I'm not thinking of) which may overwrite previous record
record1 is simple
record2 you will have to create ID it it is null
record3 again you have to create random id if it null and it won't overwrite
record3 if it has same id as record1 then it will automatically overwrite previous record. you don't have to do anything.
consider _id has primary uniq key in relational database. and it has to be uniq.
If I don't include document_id in the output config for elastic, it will generate an id on its own and I don't need to do anything. What I'm looking for is if there is a "meta value" or a way to have "" (empty string) trigger the generation that happens today when document_id is not included in the output config for elastic. I guess the alternative is to implement the same generation function early on and have events override that value when they need to. This would let some events provide a an ID when they know they need to. My concern with trying to implement the id function is collisions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.