I have recently upgraded my ELK stack and deployed in a kubernetes cluster and wanted to optimise the way I'm using it.
Use Case:
Many developer teams push their logs to a logstash pipeline which then outputs into elastic. In the past I have used "traditional" elastic indexes creating a new index each day and using curator to purge indexes to manage disk usage.
Existing indexes:
logstash-nprd-%{+YYYY.MM.dd}
logstash-prd-month-%{+YYYY.MM.dd}
logstash-prd-year-%{+YYYY.MM.dd}
This seems to work ok but I notice that a lot of the indices are are quite small and therefore not very optimised for searching as I read ideally they should be around 50GB...... Also I wanted to start using the built in Lifecycle policy's rather then the curator tool. This is where I started to see a lot about Data Streams and how it works well for logs..
So my questions / queries:
- Does my use case of developers firing logs into elastic via logstash pipelines suit the use of Data Streams?
- Being time series, what happens if a log has some latency getting into elastic and therefore the timestamp is older?, is it discarded ?
- Is the timestamp field added by logsatsh or is it from the application or elastic ?
- Are there advantages to use Data Streams over traditional elastic indices?
This is how I have configured Data Streams for testing:
Because I wanted to use different ilm policies for nonprod, prod etc... I created 3 data stream index templates as follows with different ilm policies attached to them for deleting the data:
logs-nprd-week
logs-prd-month
logs-prd-year
My logstash config:
Example for adding tags around lifecycle (lc)
input filter {
if [lc] == "PRD" or [fields][lc] == "PRD" {
if [retention] == "month" {
mutate {
add_tag => [ "prod_store_month"]
# Added as getting error re concrete value in the logs
remove_field => [ "host" ]
}
} else {
mutate {
add_tag => [ "prod_store_year"]
# Added as getting error re concrete value in the logs
remove_field => [ "host" ]
}
}
output { - example ...
if "prod_store_year" in [tags] {
elasticsearch {
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "prd-year"
hosts => ["http://elasticsearch-master:9200"]
user => "logstash_internal"
password => "xxxxx"
}
} else if "prod_store_month" in [tags] {
elasticsearch {
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "prd-month"
hosts => ["http://elasticsearch-master:9200"]
user => "logstash_internal"
password => "xxxxx"
}
Provided Data Streams are the recommended approach for my use case, have I set them up correctly and best practice?
Many many thanks for taking the time.
Hugo