After digging into ECS 1.4 and the different sub field groupings. I'm still struggling to understand the difference between service.id and service.type.
For service.id the documentation states
Unique identifier of the running service. If the service is comprised of many nodes, the service.id should be the same for all nodes.
This id should uniquely identify the service. This makes it possible to correlate logs and metrics for one specific service, no matter which particular node emitted the event.
Note that if you need to see the events from one specific host of the service, you should filter on that host.name or host.id instead.
example:d37e5ebfe0ae6c4972dbe9f0174a1637bb8247f6
For service.type the documentation states
The type of the service data is collected from.
The type can be used to group and correlate logs and metrics from one service type.
Example: If logs or metrics are collected from Elasticsearch,
service.type
would beelasticsearch
.type: keyword
example:
elasticsearch
Based on the example service.id, it seems to me that is more akin to an ephemeral id like a docker container id. But it states that if the service comprised of many nodes they should all match, which throws me off a bit.
The only situation I could see these two fields differing would be when you do a multi-node instance of a service, i.e. a sharded service/distributed service. Even then, if I have a webapi with a separate storage layer that is clustered and scalable. I would want every instance of my webapi to have a unique service.id for each node, because IMO they have no relation to each other than sharing a storage layer and a specific node could fail without blowing up the service cluster and I would just group on service.type + service.name to see my given webapi cluster's health.
Some clarity on this would be appreciated. It seems to me in most use-cases, these will be 1=1, and only when entering more advanced usecases of distributed computing models would they begin to differ. If this is the case, the documentation was difficult to understand that this was in-fact the specified case service.id comes into play.
Another thing I wanted to discuss was the idea of a service.state field.
This seems like a computed field that is the composition of a given service's collection of events. It seems bizarre to me to attempt to log the state of the service in a multi-threaded and multi-process environment where a cross-thread/process state store would be required to compare successes and failures just to log state in one log event. The performance hit alone on having to maintain a multiprocess aware date-store is a pain, not to mention the logic around determining the different "states" of a service in real time in the service based on recent requests and responses when you can literally just do that very thing very easily in ELK without impacting the complexity of my application's event logging.
It just seems extremely backwards to have this field be included in the service field group when I would most likely determine the health of my service based on metrics and the collection of events logged by it in ELK.
So as is, I would likely have 3 states composing of "starting", "running", and "stopping". Where 99% of events would be of state "running". Which in my opinion isn't of significant value since I already record lifecycle events, but I can see the argument of including them for more simplistic applications.