When to use "Object" field datatype vs flat fieldnames

I am attempting to standardize many of our log data sets using the Elastic Common Schema. One of the common practices there is to nest, fieldnames using the JSON object format.


https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html

But I'm not sure why you would choose to use JSON objects as opposed to just flat fieldnames for these datasets. Beats also use JSON object mappings, but I don't see the reason.

I started testing some mappings for this, but ultimately I'm just wondering: Why would you choose to use JSON objects as opposed to just flat fieldnames?

Example object datatype mapping:
"source": {
"properties": {
"ip": {
"type": "ip"
},
"hostname": {
"type": "keyword"
}
}
}

Example flat mapping:
"source_ip": {
"type": "ip"
},
"source_hostname": {
"type": "keyword"
}

I feel like I must be missing something and don't want to take a large design decision now that I will regret later.

1 Like

In this context, think of JSON objects as namespaces. Related information gets stored under a common root, which makes it easier for humans to identify which bits of the data belong together.

Functionally and performance-wise these 3 are equivalent:

  • "source_ip": "10.20.30.40"
  • "source.ip": "10.20.30.40"
  • "source": { "ip": "10.20.30.40"}

The 3rd option groups the related fields in the _source of a document and makes it easier to read, IMO.

1 Like

Hi @nverrill,

There's two things I see in your question

  1. Whether to nest or not to nest
  2. Whether dots should represent nesting or literal dots in your key names

You didn't ask #2, but you may stumble upon it as well.

To answer #1 directly -- your question -- nested objects are slightly easier to handle programmatically. E.g. you can delete or copy source to affect all of its subkeys at once, instead of looping over them, looking for "source_" keys. It also makes a clearer delineation between what's a section and what's simply two words joined together to represent a concept. Consider source_top_level_domain vs source.top_level_domain. The latter makes it clear what's a section name -- "source" -- and that there's a key named "top_level_domain" in there.

To answer #2 -- why nesting instead of literal dots: that's actually a decision that predates ECS by a long time. I think moving away from literal dots is an Elastic Stack 5.0 decision, iirc. As such, Logstash supported both and still does. Elasticsearch ingest pipelines only support "dedotting" dotted keys (or replacing them with nesting), but otherwise doesn't support dotted keys. I think the same is true of Beats processors as well. So for that part, the future is with "." means nesting, not literal dot.

Finally, in any case where you're adding custom fields, you're more than welcome to name them however you want; you can solely use underscore for your custom fields, if you want :+1:

But the ECS fields are nested objects, and the dots are not literal, they represent nesting. If you want to follow the schema, that's how the keys are named :slight_smile:

You can read some more at https://www.elastic.co/guide/en/ecs/current/ecs-faq.html#dot-notation and https://www.elastic.co/guide/en/ecs/current/ecs-guidelines.html#_guidelines_for_field_names.

1 Like

Thank you for the information. I ran into a couple different issues, in Logstash actually, that caused me to re-consider using the nested Objects datatypes and favoring flat fieldnames.

There seem to be a few gotchas when dealing with nested objects in Logstash. You can't quite use event.time and [event][time] 100% interchangeably, possibly due to the time at which logstash translates this into object formats, I'm really not sure how that works, but there's a level of complexity with dealing with these datatypes in Logstash that had me considering whether it was worth it. Especially when flat fieldnames may work just fine for our logging use-case.

Answer#1 is certainly something to consider though.

Thanks everyone!

Well that's precisely how Logstash currently supports both meanings of ".". These two notations are not meant to be interchangeable. The following applies everywhere in Logstash (grok notation, Logstash config notation, even Ruby API, if you use the Ruby filter).

  1. "event.time" means a literal key with a dot, no nesting
  2. "[event][time]" means object "event", with nested key "time"

Where it may get confusing is simply that Beats and Elasticsearch ingest processors interpret "event.time" as object "event", with nested key "time" (#2 above). And they don't support keys with literal dots (#1 above).

So I would instead recommend using exclusively the nesting notation in Logstash ("[event][time]"), and you'll be compatible with ECS. Otherwise you won't be compatible with ECS, which would mean that consumers of the data such as Elastic SIEM or Elastic Logs will not work well with your events.

I think I see what you mean. Hmmm. Something is not consistent and I'm unsure what.
Opening up the grok debugger in Kibana for a quick test I get this:

Sample Data:
exim[27833]: blah

Grok Pattern:
%{WORD:process.name}\[%{NUMBER:[process][id]}\]:%{SPACE}%{WORD:process.message}

Structured Data:

{
  "process": {
    "name": "exim",
    "message": "blah"
  },
  "[process][id]": "27833"
}

This doesn't look right at all. The flat field names with "." are being nested and the [process][id] looks flat with literal brackets.
I'm pretty sure this is not how they are being ingested though, because I am using [process][id] with success already.
Current index field mapping looks like fine, like this:

  "_source": {
    "ecs": {
      "version": "1.1.0"
    },
    "process": {
      "name": "exim",
      "id": "27833"
    }

Maybe this is just an issue with the grok debugger?

1 Like

Haha I didn't see this one coming! Looks like a bug in Kibana's grok debugger :slight_smile:

If you try it out with Logstash instead, you'll see the reverse:

bin/logstash -e "
  input { stdin { codec => line } }
  filter {
    grok {
      match => { 'message' => '%{WORD:process.name}\[%{NUMBER:[process][id]}\]:%{SPACE}%{WORD:process.message}' }
    }
  }
  output {
    stdout { codec => rubydebug }
  }"

And you paste the Sample event to Logstash' stdin, you'll get:

{
            "process" => {
                 "id" => "27833"
            },
           "@version" => "1",
         "@timestamp" => 2020-02-05T21:41:57.993Z,
    "process.message" => "blah",
       "process.name" => "exim",
            "message" => "exim[27833]: blah",
               "host" => "matbook-pro.lan"
}

This better demonstrates Logstash' support for nesting vs literal dots :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.