Elastic document_id

Is there any defined behavior for document_id (Elasticsearch output plugin | Logstash Reference [8.6] | Elastic) similar to pipeline (Elasticsearch output plugin | Logstash Reference [8.6] | Elastic)?

We have at least one data source where we have chosen to set a document id (was started years ago so don't know/remember the history on why) but we have been cleaning up our pipeline and were wondering if we could leverage a metadata field to specify the document_id but if empty it would behave the same as if the document_id wasn't specified. We are doing this with pipeline and it has helped greatly simplify the output logic, other than the document_id being present the output blocks are the same. Any thoughts?

Current

if source {
  elasticsearch {
    ...
    document_id => "<field value>"
  }
}
else {
  elasticsearch {
    ...
  }
}

if you use metadata field to define _id and if that is blank then you won't have that document in your index.

or should I say it will overwrite older doc.

for example
first record has metadata_id = "ben" and you have one record
second record it is empty and you will have that record inserted in index with
_id = metadata_id

now third record is empty then it will overwrite second record with third.

That's how I interpreted things too. Wanted to double check if there was something I was missing or someway to have it behave closer to that of pipeline.

ex
record1 has id 1231, gets inserted with 1231
record2 has id "", gets created without specifying id
record 3 has id "", gets created without specifying id, does not overwrite record 2 either
record2 has id 1231; follows settings of doc_as_upsert + action (or any other parameters I'm not thinking of) which may overwrite previous record

record1 is simple
record2 you will have to create ID it it is null
record3 again you have to create random id if it null and it won't overwrite
record3 if it has same id as record1 then it will automatically overwrite previous record. you don't have to do anything.

consider _id has primary uniq key in relational database. and it has to be uniq.

Agreed on the primary key function.

If I don't include document_id in the output config for elastic, it will generate an id on its own and I don't need to do anything. What I'm looking for is if there is a "meta value" or a way to have "" (empty string) trigger the generation that happens today when document_id is not included in the output config for elastic. I guess the alternative is to implement the same generation function early on and have events override that value when they need to. This would let some events provide a an ID when they know they need to. My concern with trying to implement the id function is collisions.

there is meta field called "metadata id"

this is how you generate that on logstash

mutate { add_field => { "[@metadata][id]" => "%{[host][name]}_%{some_field}" } }

after this you can check if it is blank then reset to whatever you like

and on output section you can use it like

document_id => "%{[@metadata][id]}"

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.