Create new version of existing document / enforce the id

Florck · December 22, 2016, 7:42am

Dear all,

For our need we develop a beat for which we need not to add new values at each time in elasticsearch, but potentially update existing one.

We have not found any way to enforce the id of the document to be published, then today, we do delete the existing document by an other es request in go, which is very counterproductive.

Do you have a solution when shipping directly to elasticsearch to publish a new version of a document, and if possible use custom ids ?
Thank you

ruflin · December 23, 2016, 12:50pm

Currently beats does not send an id to elasticsearch, so the id is auto generated. Beats was not really designed for the above case as it assume, each new event is a new document. Perhaps you could work around this with Logstash by creating a specific id, but I have never tried that.

Florck · December 26, 2016, 8:08am

We have thought using logstash, but the way a beat work is that people can choose whether they publish directly, in a file, or through logstash or lot of other possibilities.

Would this be really a big deal to allow this in libbeat ? I don't see bad side effects for this, just some new window of potential beats !

Best regards,

steffens · December 27, 2016, 2:05pm

what does update mean in your case? You will have a complete new event with all fields serialized (updated and not-updated fields) or just the fields being changed? This would require a change in APIs. I feel like the later case will not fit beats well, as this model of updates might be weird to handle for other outputs like kafka/redis/logstash. This would require changes and support for update model in logstash + tie the other outputs to logstash as consumer.

How exactly will the ID be 'computed'? One can use Ingest Node in elasticsearch to compute the documents ID (or just set _id). But this will overwrite the complete document, that is you're still required to have updated and not-updated fields in your documents.

For deduplication purposes we've been thinking to generate a random document id when publishing events. This idea could be easily extended to have documents provide it's own _id. But nothing concrete yet.

Florck · December 28, 2016, 9:04am

We are precisely in the first case, as resubmitting all values.

And the feature we would need is exactly by let beat provide the _id by itself if needed !

steffens · December 29, 2016, 12:02am

Thanks for the input. I'm sure we will consider this use-case once we touch _id generation support in libbeat.

Have you tried to include _id right in your event?

It's not perfect, but if _id right in your event doesn't work, (no timeline on _id support yet) I'd propose to set the id in your event and use elasticsearch ingest node to set the events id. You can test in kibana console. Using rename processor should suffice.

Florck · December 29, 2016, 10:10am

If I remember well, we did try to enforce the _id in the document provided to the bt.client.PublishEvent function, but without success, it said that this is not allowed and was closing.

I have not the shorter here, I will try to regenerate it if you want to.

Thank you!

steffens · December 29, 2016, 3:49pm

That's interesting. Well, in this case the current workaround requires a custom field (e.g. document_id) and elasticsearch ingest node to rename document_id to _id.

andrewkroh · December 29, 2016, 4:09pm

I had once started on a change to allow a Beat provide an id in the event passed to PublishEvent() that would then be included with the bulk metadata as _id. It's a small change to libbeat; it was never merged and it didn't have tests: https://github.com/andrewkroh/beats/commit/539c7f0e50e16cf3a5be8e551d3078dc0a9ef887#diff-4932dd4c52fd55b09cdd16fc456d29e6

Florck · January 3, 2017, 6:31pm

I confirm my assumption that if I try to put in the event the "_id" field, I get

2017/01/03 18:27:41.264103 client.go:420: WARN Can not index event (status=400): {"type":"mapper_parsing_exception","reason":"Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters."}

Florck · January 21, 2017, 4:16pm

@steffens , this would be a 5 minutes change to include that functionnality (I tested it)
github.com/elastic/beats/libbeat/outputs/elasticsearch/client.go

@@ -305,6 +305,7 @@ func eventBulkMeta(
        type bulkMetaIndex struct {
                Index   string `json:"_index"`
                DocType string `json:"_type"`
+               ID      string `json:"_id,omitempty"`
        }
        type bulkMeta struct {
                Index bulkMetaIndex `json:"index"`
@@ -317,6 +318,12 @@ func eventBulkMeta(
                        DocType: event["type"].(string),
                },
        }
+       if id, ok := event["id"]; ok {
+               meta.Index.ID, ok = id.(string)
+               if !ok {
+                       logp.Err("id is not a string")
+               }
+       }
        return meta
 }

Do you think you can add it to next release?
Thank you

steffens · January 22, 2017, 7:24pm

Feel free to open and continue discussion in a PR. For next release your proposal is a little late, though.

The diff looks a little incomplete, as the pipeline case is not handled. I'm -1 on handling fields in event specially in the outputs. Plus, these fields are duplicating information, as they are indexed as part of the event. That's why we've been thinking about adding support for passing additional event metadata not to be published with the event (related PR), in order to pass hints to the outputs. Note, the change is only an intermediate solution and we've got plans to refactor and improve the publisher pipeline (including how metadata can be passed).

Florck · January 23, 2017, 8:28am

Hello @steffens,

I have not fully understood the long term and middle term solutions.

Anyway, I will just patch myself the vendor folder for each of my projects up to this because I have no possibility to open a MR now.

Best regards,

steffens · January 23, 2017, 12:35pm

The middle-term solution is by passing an additional meta-data object to PublishEvent or PublishEvents. The long-term solution is not fully clear yet.

system · February 20, 2017, 12:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to set "_id" value in elasticsearch document as my custom document id Beats beats-development	6	8305	September 14, 2018
Duplication in Filebeat to Elasticsearch data pushing Beats filebeat	5	702	December 28, 2017
Libbeat does not allow updates when publishing to Elasticsearch? Beats beats-development	5	773	October 30, 2018
Update existing document with fields from incoming document Logstash	1	725	August 8, 2017
Filebeat and updating existing docs Beats filebeat	31	2849	February 6, 2023

Create new version of existing document / enforce the id

Related topics