About this question
This question:
- Spans two forum categories—Elasticsearch and Logstash—but I could select only one.
- Is a copy, with some rewording, of two comments I recently added to the closed elastic/logstash GitHub issue #5676. It belatedly occurred to me that I might get more feedback on this forum. Apologies for the crossposting.
- Is similar to questions already asked in this forum, but not, I think, identical (again, with apologies if I’m wrong about this; I don’t want to waste anyone’s time).
What I’m doing now
I am a member of the development team for a product that extracts data from proprietary binary-format logs, and then forwards that data to Logstash; for example, as JSON Lines over TCP.
Each log event that this product extracts—each line of JSON Lines that it forwards to Logstash—contains a field named time
that is the event time stamp. If the original binary-format log contains multiple candidate fields for an event time stamp, the product chooses one to use as the value of the time
field.
Currently, I use a Logstash config with a date
filter to match the value of the Logstash-generated @timestamp
field to the time
field.
But I end up with events (documents) in Elasticsearch that have both time
and @timestamp
fields, with effectively* identical values. I don’t like that duplication.
Discussion of possible options
To avoid the duplication, I could use remove_field
to remove the time
field, but this starts to grate. My input events already contain a time stamp field named time
. I’m happy with that field name. I don’t want to have to specify a date
filter to “map” that field to the Logstash-specific @timestamp
field. I don’t want to have to remove “my” time
field to avoid duplication.
I could omit the date
filter and let Logstash set @timestamp
to the default value: the time that Logstash first sees the event. I can imagine that this might be useful to assist with debugging, in the case of problems with forwarding. Given the choice, though, I think I’d prefer to save the bytes and simply omit @timestamp
, and have a “lean” Logstash config with only input and output sections; no filter section.
* The value of the @timestamp
field generated by the date
filter does not exactly match the original time
field value. The time
field value:
- Typically ends with a zone designator in the format
+hh:mm
or-hh:mm
- Contains fractions of a second to 6 decimal places (microsecond precision)
whereas @timestamp
is in UTC—always has a Z
zone designator—and contains fractions of a second to only 3 decimal places. (I understand that Elasticsearch currently represents date fields as Epoch time values with millisecond precision.)
For various reasons (that I’m happy to discuss), we (the product development team) would prefer to preserve, in the ingested Elasticsearch document, the “difference-component” zone designator and precision of the original time
field, even if these are only preserved in the original string field value of the ingested source.
A GitHub user commented:
the precedent has been set for Logstash ... to use
@timestamp
as the canonical field
That’s true, and that’s one reason why I’m grappling with this question. Because, in the context of Logstash, it leads me to set @timestamp
to the value of “my” time
field, and then remove time
:
filter {
date {
match => [ 'time', '... ' ]
remove_field => 'time'
}
Whereas, ideally, I’d prefer the time
field from my product to pass through with its original name and value to the analytics platform—Elasticsearch is just one such platform—without being “forced” into using a different field name. In practice, though, that might not be possible, because there is no “cross-platform canon” in this regard.
Other platforms aside, even within the Elastic Stack, if I bypass Logstash and use the Elasticsearch bulk API, I don’t need to introduce @timestamp
. That is, unless I want documents ingested via the bulk API to match the structure of documents ingested via Logstash.
Summary of options
- Omit the
date
filter and let the value of@timestamp
default to the time that Logstash first sees an event.
- Pros:
- More concise Logstash config.
- Perhaps (I’ve not done any benchmark testing to check this): less Logstash processing (parsing a supplied input date value versus inserting a default value).
- Perhaps: a potentially useful
@timestamp
value for debugging.
- Cons:
- I’m not convinced of the usefulness of this
@timestamp
value. Is it really worth storing in Elasticsearch? - This
@timestamp
value has nothing to do with the event. Users will have to understand the event data: they will have to know thattime
is the “true” event time stamp.
- I’m not convinced of the usefulness of this
- Specify the
date
filter withoutremove_field
.
- Pros:
@timestamp
matches the event time stamp, thereby matching the expectations of users who are familiar with the Logstash “canon”.
- Cons:
- More verbose Logstash config.
- Perhaps more Logstash processing.
- Data duplication:
@timestamp
matchestime
. - Forced to use a Logstash-specific field name, when this field name is not required by other analytics platforms, and the field name is not even required using other ingestion methods available within the Elastic Stack (such as the Elasticsearch bulk API).
- Specify the
date
filter withremove_field
.
Pros and cons same as previous item, minus the con for data duplication.
Thoughts and suggestions welcome.