Data Streams vs "Traditionally" Elastic Indexes

huowen · September 8, 2022, 10:11am

I have recently upgraded my ELK stack and deployed in a kubernetes cluster and wanted to optimise the way I'm using it.

Use Case:

Many developer teams push their logs to a logstash pipeline which then outputs into elastic. In the past I have used "traditional" elastic indexes creating a new index each day and using curator to purge indexes to manage disk usage.

Existing indexes:
logstash-nprd-%{+YYYY.MM.dd}
logstash-prd-month-%{+YYYY.MM.dd}
logstash-prd-year-%{+YYYY.MM.dd}

This seems to work ok but I notice that a lot of the indices are are quite small and therefore not very optimised for searching as I read ideally they should be around 50GB...... Also I wanted to start using the built in Lifecycle policy's rather then the curator tool. This is where I started to see a lot about Data Streams and how it works well for logs..

So my questions / queries:

Does my use case of developers firing logs into elastic via logstash pipelines suit the use of Data Streams?
Being time series, what happens if a log has some latency getting into elastic and therefore the timestamp is older?, is it discarded ?
Is the timestamp field added by logsatsh or is it from the application or elastic ?
Are there advantages to use Data Streams over traditional elastic indices?

This is how I have configured Data Streams for testing:

Because I wanted to use different ilm policies for nonprod, prod etc... I created 3 data stream index templates as follows with different ilm policies attached to them for deleting the data:

logs-nprd-week
logs-prd-month
logs-prd-year

My logstash config:

Example for adding tags around lifecycle (lc)

input filter { 

if [lc] == "PRD" or [fields][lc] == "PRD" {
         if [retention] == "month" {
           mutate {
             add_tag => [ "prod_store_month"] 
             #  Added as getting error re concrete value in the logs
             remove_field => [ "host" ]
           }
         } else { 
           mutate {
             add_tag => [ "prod_store_year"] 
             # Added as getting error re concrete value in the logs
             remove_field => [ "host" ]
           }
         }

output { - example ...

if "prod_store_year" in [tags] {
         elasticsearch {
           data_stream => "true"
           data_stream_type => "logs"
           data_stream_dataset => "prd-year"
           hosts => ["http://elasticsearch-master:9200"]
           user => "logstash_internal"
           password => "xxxxx"
         }
       } else if "prod_store_month" in [tags] {
         elasticsearch {
           data_stream => "true"
           data_stream_type => "logs"
           data_stream_dataset => "prd-month"
           hosts => ["http://elasticsearch-master:9200"]
           user => "logstash_internal"
           password => "xxxxx"
         }

Provided Data Streams are the recommended approach for my use case, have I set them up correctly and best practice?

Many many thanks for taking the time.

Hugo

warkolm · September 12, 2022, 2:31am

Yes it does
It's indexed, but it may not be in the same index as the other data from that time. How big of an issue that is will depend on your ILM policy
Usually by Logstash
For this use case, not really

huowen · September 12, 2022, 8:17am

Thanks, Mark for your response.

So just to confirm if I'm not getting any benefit from using the Data Stream for logs from application developers should I continue to use the "traditional" indexes?

I'm trying to work out best practice here for my use case and any advice is greatly appreciated.

Hugo

warkolm · September 12, 2022, 8:21am

The benefit of using it is Elasticsearch abstracts a tonne of the setup and maintenance, and you would be better off using it.

So I guess point 4 wasn't entirely correct

huowen · September 12, 2022, 8:41am

Thanks Mark, data streams it is then

Was my logic correct and best practice for my setting up of the Data Streams as I described above?

warkolm · September 12, 2022, 8:50am

It looks ok. I wouldn't be sending data to be indexed to the master though, use a data node do it doesn't cause issues with overloading the master.

huowen · September 12, 2022, 3:03pm

Ok, many thanks Mark. So creating separate templates for indices is the right way to go to apply different ilm policies to them:

logs-nprd-week
logs-prd-month
logs-prd-year

Hugo

system · October 10, 2022, 3:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Large log data indexing best practices (datastreams?) Elasticsearch ilm-index-lifecycle-management , datastreams	2	29	February 19, 2025
Indexing best practice Elasticsearch	4	464	December 23, 2020
Filebeat and Data stream Elasticsearch ilm-index-lifecycle-management	2	4350	December 7, 2020
Apply data streams, ILM, Index template through logstash Logstash ilm-index-lifecycle-management , datastreams	10	4719	September 27, 2021
Create data_stream for container logs Elasticsearch ilm-index-lifecycle-management	1	16	March 19, 2025

Data Streams vs "Traditionally" Elastic Indexes

Related topics