Dec 12th, 2018: [EN][Elasticsearch] Automatically adding a timestamp to documents


(Abdon Pijpelink) #1

Sometimes, it can be useful to know at what time a document was indexed by Elasticsearch. For example, when you want to see how long it takes for log files to end up in Elasticsearch after being generated.

Back in the old days, prior to version 5 of Elasticsearch, documents had a metadata field called _timestamp. When enabled, this _timestamp was automatically added to every document. It would tell you the exact time a document had been indexed.

In version 5 this functionality went away. The idea was that it would be better to add a timestamp field to your documents yourself, if you needed this functionality. You could for example use the Elasticsearch ingest node functionality for that. Ingest nodes allow you to specify a pipeline, made up of processors. Such a pipeline can pre-process a document before it is indexed. For example to add a timestamp field to a document, you can use a pipeline that uses the set processor.

The downside of this approach was that you had to manually tell Elasticsearch that you wanted to apply a pipeline, when you indexed some data. You would do that as part of a document's indexing request, in the Beats Elasticsearch output configuration, or in the Logstash Elasticsearch output configuration. This was a bit cumbersome, and it allowed for folks to accidentally not apply a pipeline by forgetting to specify it.

With the release of 6.5 those challenges went away. It is now possible to configure a default pipeline for an index, using the default_pipeline setting. For example:

PUT my_index
{
  "settings": {
    "default_pipeline": "my_timestamp_pipeline"
  }
}

Any document indexed to this index my_index will be pre-processed using the my_timestamp_pipeline. That pipeline could have been defined like this:

PUT _ingest/pipeline/my_timestamp_pipeline
{
  "description": "Adds a field to a document with the time of ingestion",
  "processors": [
    {
      "set": {
        "field": "ingest_timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}

The pipeline above adds a field ingest_timestamp to every document with the value of the current timestamp. We can test that this works by indexing a document to this index, without explicitly defining a pipeline:

PUT my_index/_doc/1
{
  "foo": "bar"
}

If we now retrieve this document, we can see that a field ingest_timestamp was added to it:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "foo" : "bar",
    "ingest_timestamp" : "2018-12-12T12:04:09.921Z"
  }
}

Ingest pipelines are a powerful feature of Elasticsearch, and now that it's possible to specify default pipelines on indexes, you can ensure that a pipeline is applied to every document.

Want to know more about ingest pipelines? Come back tomorrow for more pipeline goodness, when @lwintergerst will discuss chaining pipelines.


Dec 13th, 2018: [EN][Elasticsearch] Chaining Ingest Pipelines