Data Streams vs "Traditionally" Elastic Indexes

I have recently upgraded my ELK stack and deployed in a kubernetes cluster and wanted to optimise the way I'm using it.

Use Case:

Many developer teams push their logs to a logstash pipeline which then outputs into elastic. In the past I have used "traditional" elastic indexes creating a new index each day and using curator to purge indexes to manage disk usage.

Existing indexes:
logstash-nprd-%{+YYYY.MM.dd}
logstash-prd-month-%{+YYYY.MM.dd}
logstash-prd-year-%{+YYYY.MM.dd}

This seems to work ok but I notice that a lot of the indices are are quite small and therefore not very optimised for searching as I read ideally they should be around 50GB...... Also I wanted to start using the built in Lifecycle policy's rather then the curator tool. This is where I started to see a lot about Data Streams and how it works well for logs..

So my questions / queries:

  • Does my use case of developers firing logs into elastic via logstash pipelines suit the use of Data Streams?
  • Being time series, what happens if a log has some latency getting into elastic and therefore the timestamp is older?, is it discarded ?
  • Is the timestamp field added by logsatsh or is it from the application or elastic ?
  • Are there advantages to use Data Streams over traditional elastic indices?

This is how I have configured Data Streams for testing:

Because I wanted to use different ilm policies for nonprod, prod etc... I created 3 data stream index templates as follows with different ilm policies attached to them for deleting the data:

logs-nprd-week
logs-prd-month
logs-prd-year

My logstash config:

Example for adding tags around lifecycle (lc)

input filter { 

if [lc] == "PRD" or [fields][lc] == "PRD" {
         if [retention] == "month" {
           mutate {
             add_tag => [ "prod_store_month"] 
             #  Added as getting error re concrete value in the logs
             remove_field => [ "host" ]
           }
         } else { 
           mutate {
             add_tag => [ "prod_store_year"] 
             # Added as getting error re concrete value in the logs
             remove_field => [ "host" ]
           }
         }

output { - example ...

if "prod_store_year" in [tags] {
         elasticsearch {
           data_stream => "true"
           data_stream_type => "logs"
           data_stream_dataset => "prd-year"
           hosts => ["http://elasticsearch-master:9200"]
           user => "logstash_internal"
           password => "xxxxx"
         }
       } else if "prod_store_month" in [tags] {
         elasticsearch {
           data_stream => "true"
           data_stream_type => "logs"
           data_stream_dataset => "prd-month"
           hosts => ["http://elasticsearch-master:9200"]
           user => "logstash_internal"
           password => "xxxxx"
         }

Provided Data Streams are the recommended approach for my use case, have I set them up correctly and best practice?

Many many thanks for taking the time.

Hugo

  1. Yes it does
  2. It's indexed, but it may not be in the same index as the other data from that time. How big of an issue that is will depend on your ILM policy
  3. Usually by Logstash
  4. For this use case, not really

Thanks, Mark for your response.

So just to confirm if I'm not getting any benefit from using the Data Stream for logs from application developers should I continue to use the "traditional" indexes?

I'm trying to work out best practice here for my use case and any advice is greatly appreciated.

Hugo

The benefit of using it is Elasticsearch abstracts a tonne of the setup and maintenance, and you would be better off using it.

So I guess point 4 wasn't entirely correct :wink:

Thanks Mark, data streams it is then :+1:

Was my logic correct and best practice for my setting up of the Data Streams as I described above?

It looks ok. I wouldn't be sending data to be indexed to the master though, use a data node do it doesn't cause issues with overloading the master.

Ok, many thanks Mark. So creating separate templates for indices is the right way to go to apply different ilm policies to them:

logs-nprd-week
logs-prd-month
logs-prd-year

Hugo