I am attempting to standardize many of our log data sets using the Elastic Common Schema. One of the common practices there is to nest, fieldnames using the JSON object format.
But I'm not sure why you would choose to use JSON objects as opposed to just flat fieldnames for these datasets. Beats also use JSON object mappings, but I don't see the reason.
I started testing some mappings for this, but ultimately I'm just wondering: Why would you choose to use JSON objects as opposed to just flat fieldnames?
In this context, think of JSON objects as namespaces. Related information gets stored under a common root, which makes it easier for humans to identify which bits of the data belong together.
Functionally and performance-wise these 3 are equivalent:
"source_ip": "10.20.30.40"
"source.ip": "10.20.30.40"
"source": { "ip": "10.20.30.40"}
The 3rd option groups the related fields in the _source of a document and makes it easier to read, IMO.
Whether dots should represent nesting or literal dots in your key names
You didn't ask #2, but you may stumble upon it as well.
To answer #1 directly -- your question -- nested objects are slightly easier to handle programmatically. E.g. you can delete or copy source to affect all of its subkeys at once, instead of looping over them, looking for "source_" keys. It also makes a clearer delineation between what's a section and what's simply two words joined together to represent a concept. Consider source_top_level_domain vs source.top_level_domain. The latter makes it clear what's a section name -- "source" -- and that there's a key named "top_level_domain" in there.
To answer #2 -- why nesting instead of literal dots: that's actually a decision that predates ECS by a long time. I think moving away from literal dots is an Elastic Stack 5.0 decision, iirc. As such, Logstash supported both and still does. Elasticsearch ingest pipelines only support "dedotting" dotted keys (or replacing them with nesting), but otherwise doesn't support dotted keys. I think the same is true of Beats processors as well. So for that part, the future is with "." means nesting, not literal dot.
Finally, in any case where you're adding custom fields, you're more than welcome to name them however you want; you can solely use underscore for your custom fields, if you want
But the ECS fields are nested objects, and the dots are not literal, they represent nesting. If you want to follow the schema, that's how the keys are named
Thank you for the information. I ran into a couple different issues, in Logstash actually, that caused me to re-consider using the nested Objects datatypes and favoring flat fieldnames.
There seem to be a few gotchas when dealing with nested objects in Logstash. You can't quite use event.time and [event][time] 100% interchangeably, possibly due to the time at which logstash translates this into object formats, I'm really not sure how that works, but there's a level of complexity with dealing with these datatypes in Logstash that had me considering whether it was worth it. Especially when flat fieldnames may work just fine for our logging use-case.
Answer#1 is certainly something to consider though.
Well that's precisely how Logstash currently supports both meanings of ".". These two notations are not meant to be interchangeable. The following applies everywhere in Logstash (grok notation, Logstash config notation, even Ruby API, if you use the Ruby filter).
"event.time" means a literal key with a dot, no nesting
"[event][time]" means object "event", with nested key "time"
Where it may get confusing is simply that Beats and Elasticsearch ingest processors interpret "event.time" as object "event", with nested key "time" (#2 above). And they don't support keys with literal dots (#1 above).
So I would instead recommend using exclusively the nesting notation in Logstash ("[event][time]"), and you'll be compatible with ECS. Otherwise you won't be compatible with ECS, which would mean that consumers of the data such as Elastic SIEM or Elastic Logs will not work well with your events.
I think I see what you mean. Hmmm. Something is not consistent and I'm unsure what.
Opening up the grok debugger in Kibana for a quick test I get this:
This doesn't look right at all. The flat field names with "." are being nested and the [process][id] looks flat with literal brackets.
I'm pretty sure this is not how they are being ingested though, because I am using [process][id] with success already.
Current index field mapping looks like fine, like this:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.