How to structure observability data about the wellness of my (custom) app

Hi,

I have a (big, old) application which is monitored on production, to detect when something is mis-behaving and pro-actively notify operators.

The current monitoring is composed of a big script which executes a batch of 200 tests every 15 minutes (details below). Then an (HTML) report is generated and sent by mail.

I would like to export the results of those tests in my elastic stack.

Objectives are not fully defined for now ; it could be mid term:

  • storing a history of healthness of the application, keeping a trace of what happened and when.
  • generating KPIs
  • using alerting features provided by elastic, rather than sending mails inside the script.

my question:

As I'm free to choose the output generated by the script (number of 'events', their content, ...), I'm in front of (likely) a blank page and have choices to do:

  • Should I consider my data as a collection of events ? metrics ? a mix ?
    What would be the (ECS) field(name)s i should/must output, in which indices ?

  • which approach should I choose for the data structure, and which best practises to follow (on top of ECS), in order to leverage the features already embedded inside the elastic stack (observability, alerting, ...) ?

Thanks in advance.

content of a 'test':

A 'test' is composed of :

  • a category, sub-category and a name
  • a severity level (major, blocking, system blocking, ...)
  • a final status of the test: "OK" / "FAIL", based on:
    • the value returned by the test (one of a string, a date, a number) ...
    • which is compared to a value of reference for this test ...
    • using a comparison operator (<,>,=, in, ...) for this test.
  • I also have other metadata around each test (unique id for script execution, start / end timestamps, test execution duration, instance, ...)
  • And finally a few metadata related to the whole batch itself (uniqueID, instance, duration, ...)

My question: as I'm free to choose the structure and number of

Below, two examples of test data:

   {
      "run_id":"MONITOR_PREPROD_1707316974989",
      "instance_name":"PREPROD",
      "test_name":"current business date",
      "category":"Availability checks",
      "sub_category":"Dates",
      "test_severity":"3",
      "test_result":"FAIL",
      "test_returned_value":"02/01/2024",
      "test_expected_value":"07/02/2024",
      "test_eval_type":"=",
      "test_start_ts":1707316974996,
      "test_end_ts":1707316974999,
      "test_duration_ms":3
   }
   {
      "run_id":"MONITOR_PREPROD_1707316974989",
      "instance_name":"PREPROD",
      "test_name":"Messages pending in queue",
      "category":"Processing queues",
      "sub_category":"pending messages",
      "test_severity":"1",
      "test_returned_value":"2",
      "test_expected_value":"100",
      "test_eval_type":"<",
      "test_result":"OK",
      "test_start_ts":1707316975009,
      "test_end_ts":1707316975011,
      "test_duration_ms":2
   }