Hello,
I've got some software that generates status information every second, however I only want to record state changes of certain fields, not every single document. An example of what it would look like is:
{ timestamp: T+0, alarms: "OK", extra_data: ... }
{ timestamp: T+1, alarms: "OK", extra_data: ... }
{ timestamp: T+2, alarms: "ERR", extra_data: ... }
{ timestamp: T+3, alarms: "ERR", extra_data: ... }
{ timestamp: T+4, alarms: "OK", extra_data: ... }
{ timestamp: T+5, alarms: "OK", extra_data: ... }
{ timestamp: T+6, alarms: "OK", extra_data: ... }
In the above set I would only need to record the documents at T+2 (OK -> ERR) and T+4 (ERR -> OK).
I feel like the Elasticsearch filter plugin might be made to do what I want, but it strikes me as being expensive (I have to make an API query for every document, and there's going to be a lot of documents). A better option would be for logstash to remember and filter it's own batches before sending them to elasticsearch, but I don't know if it can do that.
I've read about setting the document ID to a hash of the unique fields that doesn't include the timestamp, but that's not going to work for me as this data might flip between OK and ERR several times a day and I need to capture each event change,
I've also read about doing post-processing jobs on elasticsearch to de-duplicate documents, but this also strikes me as expensive (plus I want the event state changes available quickly, not relying on a de-dupe job to finish).