Hey christopilus,
Thanks for the good followup and the great suggestions. We actually went ahead with a solution similar to your first one but also combined it with a clone of the events we're interested in.
The full "solution" in a simple explanation;
-
Add a support tag "PERFORMANCE" or something similar to events you're interested in using for latest_run dashboards but also need to see in dashboards where all jobs need to be shown.
-
For all events with the tag "PERFORMANCE", clone them in logstash (see clone plugin) and add another tag "LATEST_RUN" to the cloned events. Now you have duplicate events for all performance metrics.
-
For events for which the tag "LATEST_RUN" is set, use a specific elasticsearch output in logstash that applies the following rules/settings:
- Overwrite the document_id to a fixed value using a combination of other events that is guaranteed to be unique for one job on a specific rundate. Make sure to apply the setting that removes documents with the old document_id.
- The index you're writing to for these events has to be the same for all possible runs and reruns of the same job. With other words: don't dynamically choose the index based on the rundate of the job or when the logs are parsed by logstash as documents with the same id on different indexes will not be updated.
- You now have all runs of the job in elasticsearch and events with the "PERFORMANCE" tag are duplicated. You now need to make a distinction for visualizations if the "LATEST_RUN" tag should be applied or not, (if forgotten duplicated events are shown in the visualization). Dashboards with the tag set now automatically update themselves to reflect the latest run.
This feature is handy if you have daily batch chains with a fixed amount of jobs (1000+), each with their own logging, but are prone to reruns due to data quality issues or whatever. This allows to compare the performance of the chains throughout multiple days and/or months without reruns skewing the view.
Thanks again for your help! I hope someone can be helped with this explanation.