We have a busy production set-up generating many 1000's of logs, which are forwarded by beats to kafka before hitting logstash and elastic.
Our server paged us last week at the end of the business day (when load subsides) due to high CPU. On investigation, we found that two of our servers followed the same pattern.
Initially, at about the start of the business day, the @timestamp and timestamp (ingestion and generated time) started to diverge. (This is unusual and this is the only time we've seen it). Then at about 8pm, the CPU spiked, the @timestamp in elk spiked, and the filespace on the two servers reduced. After this, the two timestamps converged, so all good, and we've not seen it since.
My question relates to filebeats. Given Kafka is in the mix, is there a reason that filebeats appears to have slowed down and then played caught up?
This sounds like backpressure, but after talking to the solution architects, they think that kafka should have cached the data and filebeats should have worked steadily.
Is the backpressure mechanism really only TCP Congestion Control and so perhaps a busy kafka could have the same impact?
Thank you