Emerging standards for ECS customization

I am curious if there are any emerging patterns / practices / standards relating to the development of ECS custom extensions that can be shared with the community.

We are experiencing difficulties in making logically consistent decisions about how to parse various events sources into custom ECS fields. I am looking to the Beats modules for guidance, but the mapping practices of the module developers are very inconsistent. Some examples:

  • The Palo Alto PANOS and Cisco ASA modules use a [vendor].[product] base field for events. (e.g. panw.panos and cisco.asa)

  • IBM Message Queue and Microsoft SQL modules use [vendor_abbreviation][vendor_product] base fields (e.g. ibmmq and mssql).

  • The Elastic Stack, IIS, and Azure modules add a root field for a product with no reference to a vendor (e.g. elasticsearch, logstash, azure, iis)

  • Winlogbeat and AuditD modules skip all reference to vendor or product, and simply adds a top-level field to hold a events from an OS subsystem (e.g. winlog for Windows Event Log events, auditd for Linux AuditD events).

Which of these approaches is "right"? I think some leadership and guidance would be appreciated by the community.

I know we are under no obligation to conform our in-house customizations with customizations provided thought the Beats modules. However, we already decided to remap our in-house Panos, Azure event hub, Kafka, and Cisco ISE customizations to conform with Beats module developer decisions. We felt this was worthwhile as it allows us to take advantage of Kibana visualizations, dashboards, and machine learning jobs provided by Elastic.co. However, the remapping work would have been reduced with better schema development guidance from Elastic.

Any thoughts or feedback?

-J. Greg Mackinnon
Yale University

Great question, thanks Greg!

What you're looking at, when looking at all sources published by Elastic, is a mix of actual strategy and also decisions that predates ECS. ECS turned 1.0 a little over a year ago :tada:, while the Elasticsearch community has been going strong for around 10 years. So you have to take some and leave some.

ECS offers this guidance on custom fields, but it doesn't quite touch on your question:

https://www.elastic.co/guide/en/ecs/current/ecs-custom-fields-in-ecs.html

So as you point out, when there's different strategies that can be employed to structuring custom fields:

  1. leaf field at the root of a document:
    { "my_field": "foo" }
  2. adding leaf fields within existing ECS fields:
    { "event": { "my_field": "foo" } }
  3. field inside a namespace:
    { "my_schema": { "my_field": "foo" } }
  4. vendor namespace, subject namespace(s), then fields:
    { "vendor": { "my_schema": { "my_field": "foo" } } }.
    Note that this nesting can go more than one level, especially if a product produces more than one type of events { "vendor": { "product": { "dataset": { "my_field": "foo" } } } }

You can find all of the above in Beats.

One very important aspect that drives a lot the design of ECS and can apply to custom fields, is that using nesting literally creates new namespaces that allow us to avoid clashes between concepts that are similarly named, but related to different things.

Just based on this concept we can dismiss no 1. above. Fields at the root of documents are discouraged in ECS, because they take up a whole namespace.

Think of {"user": "alice"}. Now we'd like to track user id, email and so on... Much better to be able to grow inside of a namespace: {"user": { "name": "alice", "id": ...}.

No 2. above should be avoided, because this can cause confusion for end users (why isn't event.my_field in the ECS docs?) and because there's a higher chance of having a conflict, if ECS ever decides to add event.my_field.

However it's sometimes acceptable to do no 2. Either for forward compatibility (e.g. ECS has merged a change that's not yet released officially) or because the concept is so simple that if it ever gets in, you're likely to guess the semantics right anyway. If ECS ever adds the concept as another name, then you just transition from one field name to the other; you can have both fields at the same time during the transition while all consumers of the data get adjusted.

The ideal ways are really no 3 and 4, IMO. Sometimes there's no concept of a vendor, and the data source in question really only produces one kind of event. So one level nesting could be enough.

I think in most cases though, it's useful to have multiple levels of nesting. You make a good point that the custom fields for Elastic stack logs are directly under a product namespace, and it would be a good idea to eventually move them under a vendor namespace. This allows the definition of vendor-wide concepts, while giving each product its own namespace as well. You can think of our common stack release, for example:

Logstash log:

{ "elastic": { 
    "stack_version": "7.6.1",
    "logstash": { "pipeline_id"... }
} }

Elasticsearch log:

{ "elastic": {
    "stack_version": "7.6.1",
    "elasticsearch": { "node_name"... } } }

Now looping back on the doc link I shared above, ideally you'll want to create a top level namespace that's unlikely to ever get into ECS. A brand name (vendor, product) or a project name can be good choices. By contrast, starting at the top with a general concept poses a risk that ECS later adds this concept. So this is not the ideal approach.

However while you work within this custom namespace, it's fine to use general concepts again, since by definition you're inside your own namespace. Think of correlation between your org's many models Cisco firewalls (presumably):

Model 1:

{ "cisco": {
    "firewall": { "foo": "some general firewall concept" },
    "model1": { model1-specific fields }
} }

Model 2:

{ "cisco": {
    "firewall": { "foo": "some general firewall concept" },
    "model2": { model2-specific fields }
} }

Now you can correlate between general firewall concepts, no matter the exact model, while still leaving room for model-specific details.

All of the above is guidance on how to structure your own custom events. Note that if your goal is specifically to correlate between one of your custom data sources with a specific Beats module (e.g. one coming in via Kafka + Logstash and one coming directly from Beats), you're welcome to match what Beats does, even if it's not in ECS.