Create new version of existing document / enforce the id


#1

Dear all,

For our need we develop a beat for which we need not to add new values at each time in elasticsearch, but potentially update existing one.

We have not found any way to enforce the id of the document to be published, then today, we do delete the existing document by an other es request in go, which is very counterproductive.

Do you have a solution when shipping directly to elasticsearch to publish a new version of a document, and if possible use custom ids ?
Thank you


(ruflin) #2

Currently beats does not send an id to elasticsearch, so the id is auto generated. Beats was not really designed for the above case as it assume, each new event is a new document. Perhaps you could work around this with Logstash by creating a specific id, but I have never tried that.


#3

We have thought using logstash, but the way a beat work is that people can choose whether they publish directly, in a file, or through logstash or lot of other possibilities.

Would this be really a big deal to allow this in libbeat ? I don't see bad side effects for this, just some new window of potential beats !

Best regards,


(Steffen Siering) #4

what does update mean in your case? You will have a complete new event with all fields serialized (updated and not-updated fields) or just the fields being changed? This would require a change in APIs. I feel like the later case will not fit beats well, as this model of updates might be weird to handle for other outputs like kafka/redis/logstash. This would require changes and support for update model in logstash + tie the other outputs to logstash as consumer.

How exactly will the ID be 'computed'? One can use Ingest Node in elasticsearch to compute the documents ID (or just set _id). But this will overwrite the complete document, that is you're still required to have updated and not-updated fields in your documents.

For deduplication purposes we've been thinking to generate a random document id when publishing events. This idea could be easily extended to have documents provide it's own _id. But nothing concrete yet.


#5

We are precisely in the first case, as resubmitting all values.

And the feature we would need is exactly by let beat provide the _id by itself if needed !


(Steffen Siering) #6

Thanks for the input. I'm sure we will consider this use-case once we touch _id generation support in libbeat.

Have you tried to include _id right in your event?

It's not perfect, but if _id right in your event doesn't work, (no timeline on _id support yet) I'd propose to set the id in your event and use elasticsearch ingest node to set the events id. You can test in kibana console. Using rename processor should suffice.


#7

If I remember well, we did try to enforce the _id in the document provided to the bt.client.PublishEvent function, but without success, it said that this is not allowed and was closing.

I have not the shorter here, I will try to regenerate it if you want to.

Thank you!


(Steffen Siering) #8

That's interesting. Well, in this case the current workaround requires a custom field (e.g. document_id) and elasticsearch ingest node to rename document_id to _id. :frowning:


(Andrew Kroh) #9

I had once started on a change to allow a Beat provide an id in the event passed to PublishEvent() that would then be included with the bulk metadata as _id. It's a small change to libbeat; it was never merged and it didn't have tests: https://github.com/andrewkroh/beats/commit/539c7f0e50e16cf3a5be8e551d3078dc0a9ef887#diff-4932dd4c52fd55b09cdd16fc456d29e6


#10

I confirm my assumption that if I try to put in the event the "_id" field, I get

2017/01/03 18:27:41.264103 client.go:420: WARN Can not index event (status=400): {"type":"mapper_parsing_exception","reason":"Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters."}

#11

@steffens , this would be a 5 minutes change to include that functionnality (I tested it)
github.com/elastic/beats/libbeat/outputs/elasticsearch/client.go

@@ -305,6 +305,7 @@ func eventBulkMeta(
        type bulkMetaIndex struct {
                Index   string `json:"_index"`
                DocType string `json:"_type"`
+               ID      string `json:"_id,omitempty"`
        }
        type bulkMeta struct {
                Index bulkMetaIndex `json:"index"`
@@ -317,6 +318,12 @@ func eventBulkMeta(
                        DocType: event["type"].(string),
                },
        }
+       if id, ok := event["id"]; ok {
+               meta.Index.ID, ok = id.(string)
+               if !ok {
+                       logp.Err("id is not a string")
+               }
+       }
        return meta
 }

Do you think you can add it to next release?
Thank you


(Steffen Siering) #12

Feel free to open and continue discussion in a PR. For next release your proposal is a little late, though.

The diff looks a little incomplete, as the pipeline case is not handled. I'm -1 on handling fields in event specially in the outputs. Plus, these fields are duplicating information, as they are indexed as part of the event. That's why we've been thinking about adding support for passing additional event metadata not to be published with the event (related PR), in order to pass hints to the outputs. Note, the change is only an intermediate solution and we've got plans to refactor and improve the publisher pipeline (including how metadata can be passed).


#13

Hello @steffens,

I have not fully understood the long term and middle term solutions.

Anyway, I will just patch myself the vendor folder for each of my projects up to this because I have no possibility to open a MR now.

Best regards,


(Steffen Siering) #14

The middle-term solution is by passing an additional meta-data object to PublishEvent or PublishEvents. The long-term solution is not fully clear yet.


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.