Drop documents if certain field values are the same as the previous document

Hello,

I've got some software that generates status information every second, however I only want to record state changes of certain fields, not every single document. An example of what it would look like is:

{ timestamp: T+0, alarms: "OK", extra_data: ... }
{ timestamp: T+1, alarms: "OK", extra_data: ... }
{ timestamp: T+2, alarms: "ERR", extra_data: ... }
{ timestamp: T+3, alarms: "ERR", extra_data: ... }
{ timestamp: T+4, alarms: "OK", extra_data: ... }
{ timestamp: T+5, alarms: "OK", extra_data: ... }
{ timestamp: T+6, alarms: "OK", extra_data: ... }

In the above set I would only need to record the documents at T+2 (OK -> ERR) and T+4 (ERR -> OK).

I feel like the Elasticsearch filter plugin might be made to do what I want, but it strikes me as being expensive (I have to make an API query for every document, and there's going to be a lot of documents). A better option would be for logstash to remember and filter it's own batches before sending them to elasticsearch, but I don't know if it can do that.

I've read about setting the document ID to a hash of the unique fields that doesn't include the timestamp, but that's not going to work for me as this data might flip between OK and ERR several times a day and I need to capture each event change,

I've also read about doing post-processing jobs on elasticsearch to de-duplicate documents, but this also strikes me as expensive (plus I want the event state changes available quickly, not relying on a de-dupe job to finish).

You could use the ruby filter in logstash to remember the last value of the fields in global variables and than compare these against each other.

But you would have to set the workers in logstash to 1 else there could be some false order of the processed events. (even with workers set to 1 i am not sure if the order would be always correct)

Ok thanks, I'll check out the Ruby filter. I've started to think that this is probably best achieved further upstream though - de-duplicate the events before they get sent to logstash.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.