We're running a single-node ELK cluster (Elasticsearch and Logstash) on the same host, but each in their own docker container, both version 7.3.0.
The system has been stable for the past 5 months, ingesting around 20k logs per minute.
An incident occurred when logs stopped flowing to Elasticsearch, by inspecting the stack metrics, Elasticsearch was green, utilizing only 20% of its JVM Heap, sitting idle, not indexing any-data. but respond to queries with the expected performance.
By checking CPU and Memory usage, the entire host CPU was stable at 2%, and memory usage was as heaps were configured.
No logs (Errors or Warnings) are on either Elasticsearch or Logstash, I was suspecting GC logs but there weren't any.
By restarting Logstash, everything started flowing again and Elasticsearch started indexing millions of accumulated logs.
The other day the same incident happened, I did more inspection, Logstash was holding a connection with every filebeat agent shipping logs. however, there wasn't any network traffic happening (really minor KBs every now and then). I tried restarting the filebeat agents that connect to logstash, but that didn't solve it.
Again, restarting Logstash solved everything.
So, Logstash stops accepting events from filebeat agents, and sit idle, not spitting any logs, and not emitting any events, while:
- It still hold connections to the Filebeat Agents
- CPU and Memory are not utilizied, so It is not a resource problem.
- The host is not throttled on IOPs (It's an EC2 instance)
- Metrics shows sudden drops in events.
- No logs indicating any problem.
- Pipeline type is in-memory.
- Restarting solves everything.