Great question, thanks Greg!
What you're looking at, when looking at all sources published by Elastic, is a mix of actual strategy and also decisions that predates ECS. ECS turned 1.0 a little over a year ago , while the Elasticsearch community has been going strong for around 10 years. So you have to take some and leave some.
ECS offers this guidance on custom fields, but it doesn't quite touch on your question:
https://www.elastic.co/guide/en/ecs/current/ecs-custom-fields-in-ecs.html
So as you point out, when there's different strategies that can be employed to structuring custom fields:
- leaf field at the root of a document:
{ "my_field": "foo" }
- adding leaf fields within existing ECS fields:
{ "event": { "my_field": "foo" } }
- field inside a namespace:
{ "my_schema": { "my_field": "foo" } }
- vendor namespace, subject namespace(s), then fields:
{ "vendor": { "my_schema": { "my_field": "foo" } } }
.
Note that this nesting can go more than one level, especially if a product produces more than one type of events { "vendor": { "product": { "dataset": { "my_field": "foo" } } } }
You can find all of the above in Beats.
One very important aspect that drives a lot the design of ECS and can apply to custom fields, is that using nesting literally creates new namespaces that allow us to avoid clashes between concepts that are similarly named, but related to different things.
Just based on this concept we can dismiss no 1. above. Fields at the root of documents are discouraged in ECS, because they take up a whole namespace.
Think of {"user": "alice"}
. Now we'd like to track user id, email and so on... Much better to be able to grow inside of a namespace: {"user": { "name": "alice", "id": ...}
.
No 2. above should be avoided, because this can cause confusion for end users (why isn't event.my_field
in the ECS docs?) and because there's a higher chance of having a conflict, if ECS ever decides to add event.my_field
.
However it's sometimes acceptable to do no 2. Either for forward compatibility (e.g. ECS has merged a change that's not yet released officially) or because the concept is so simple that if it ever gets in, you're likely to guess the semantics right anyway. If ECS ever adds the concept as another name, then you just transition from one field name to the other; you can have both fields at the same time during the transition while all consumers of the data get adjusted.
The ideal ways are really no 3 and 4, IMO. Sometimes there's no concept of a vendor, and the data source in question really only produces one kind of event. So one level nesting could be enough.
I think in most cases though, it's useful to have multiple levels of nesting. You make a good point that the custom fields for Elastic stack logs are directly under a product namespace, and it would be a good idea to eventually move them under a vendor namespace. This allows the definition of vendor-wide concepts, while giving each product its own namespace as well. You can think of our common stack release, for example:
Logstash log:
{ "elastic": {
"stack_version": "7.6.1",
"logstash": { "pipeline_id"... }
} }
Elasticsearch log:
{ "elastic": {
"stack_version": "7.6.1",
"elasticsearch": { "node_name"... } } }
Now looping back on the doc link I shared above, ideally you'll want to create a top level namespace that's unlikely to ever get into ECS. A brand name (vendor, product) or a project name can be good choices. By contrast, starting at the top with a general concept poses a risk that ECS later adds this concept. So this is not the ideal approach.
However while you work within this custom namespace, it's fine to use general concepts again, since by definition you're inside your own namespace. Think of correlation between your org's many models Cisco firewalls (presumably):
Model 1:
{ "cisco": {
"firewall": { "foo": "some general firewall concept" },
"model1": { model1-specific fields }
} }
Model 2:
{ "cisco": {
"firewall": { "foo": "some general firewall concept" },
"model2": { model2-specific fields }
} }
Now you can correlate between general firewall concepts, no matter the exact model, while still leaving room for model-specific details.
All of the above is guidance on how to structure your own custom events. Note that if your goal is specifically to correlate between one of your custom data sources with a specific Beats module (e.g. one coming in via Kafka + Logstash and one coming directly from Beats), you're welcome to match what Beats does, even if it's not in ECS.