We have recently did an upgrade of the ELK stack into 7.2.0 and started using ILM.
With the usage of ILM we have encountered multiple issues which sometime originate with a single one.
The architecture is quite simple, we have beats sending to Kafka, and LS reading from kafka, digesting and indexing to Hot-Warm-Cold
The flow is this:
- Number of logs drastically increase due to error in the system
- The volume is too high and the ES gets into read-only due to disk space
- Lag is created
And now begins chain of issues
- Operator tries to move indexes, and they are read-only (this is also with curator)
- If Operator deletes index, the ILM gets stuck, that is due to auto-creation of index
- The lag is big, and the timestamp of the even is not taken into account when writing to the index, that is, the writing will be done to the index active during the index operation, not on the event's timestamp (unlike without ILM)
- Since the lag is being consumed, and the rollover is calculated time based (not the initial, the move to warm), that yet again creates a congestion on the hot servers as they take all the lag
- Cannot explain yet why, but at those cases we also see indexes created far larger than the IML settings, i.e, initial rollover set to 100G, and the index size grows to 1T (500G primary)
So questions are:
- Is there some recommendation that I've missed, to disable the auto-create index if using ILM, and if so, the ILM is being set on the Logstash side, if I want to do it properly I assume I will must realy on naming convention (?)
- Is there a way (or plan) to add to the rollover more criteria, based on disk space, when moving from Hot to Warm or Cold? We are with physical servers, I need the retention to be flexible, not the disk space
- Is there a way to tell Logstash, with ILM, to take into account the event's timestamp or is it intentionally ignored?
Thanks in advance