Just to check I’m understanding , the Kibana screenshot shows 2 Elasticsearch documents with the same @tinestamp value. They have slightly (45 seconds) later ls_timestamp values, which were generated as they passed through logstash. And a message field that looks to be constructed from some date/time value.
So where precisely is the @timestamp coming from? Is filebeat adding it? And where precisely is the date/time value in message coming from ?
I once had an issue where I was confused with sequencing of timestamps. And it turned out the timestamp in the application log file was the time the random mobile device thought it was, and consumer devices are just time and dare screwed in a myriad of ways!! That meant that “timestamp” had to be (broadly) ignored.
You are spot on. The @timestamp is coming from Filebeat, it records the exact moment Filebeat reads and harvests that specific log line from the file. The date/time in the message field is coming directly from the application itself (our Tomcat and SMPP apps) at the exact millisecond the event occurred and was written to the local log file, the delay you see between the message time, the @timestamp (Filebeat read time), and the ls_timestamp (Logstash processing time) is simply the visual evidence of the processing queue backlog and transit time we were experiencing
The difference between the @timestamp coming from Filebeat and the ls_timestamp generated are small, just some seconds.
In the screenshot you shared you have 2 events, in both the @timestamp and ls_timestamp are pretty close, but the time in the message field in one of them the timestamp is something like 2026-02-24 10:45:45, which is also pretty close, but in the other you have 2026-02-24 06:04:08 which has a 4 hour difference from the other times.
I think this suggests that the backlog is on Filebeat side, not on Logstash, as you mentioned you logs rotate pretty fast, this can cause this backlog if I'm not wrong.
Can you share a print with the same 2 events again, but include the filebeat offset and path fields?
I would make a change in your Filebeat output, since you have only one Logstash, remove the other ports, use just one port and increase the number of workers, use something like 8 or 12.
Well, as well as all that, and I also though a delay of 45 seconds wasn’t particularly remarkable, where I was going is that you don’t have evidence of precisely when a specific log entry is written to the log file. And the 2 application timestamps for those 2 entries was just a bit weird. One a few seconds before ls_timestamp and one ages before.
I disabled pipelining and increased the pool to 8 workers. I've noticed the bottleneck consistently occurs with larger files. While most files are only a few KB or MB, others reach up to 100MB, which seems to be causing the slowdown.
They’re in the same data center, and Logstash is saving to a volume group. and for the network latency I've noticed a pattern where logs aren't missing, but they are arriving with significant latency because Filebeat is struggling to connect to Logstash, even with a timeout of 10000
Something is not OK, maybe LS cannot process and respond fast enough. Have you checked LS CPU utilization?
Have you try to do output FB to a local file or directly to ES, just for a test?
Actually your batch is quite big size: pipeline.batch.size: 2048, set it to default value.
pipeline.workers: 12 - no need to set, This defaults to the number of the host's CPU cores.
Just a quick note: Logstash and Elasticsearch share the same server. I limit the workers manually so Logstash doesn't grab all the cores and starve ES. Also, the server CPU is only at 20-30%, so Logstash isn't using its full capacity.
I noticed the backlog issue is specifically happening with the massive files (>200MB).
And yes, I already commented out stdout and the rubydebug
You have too many GREEDYDATA in the middle message, replace with DATA, should be faster
Add id => "grok" in grok and id => "elasticsearch" in the output. You can also add in beats. Check LS statistics with url http://localhost:9600/_node/stats/pipelines?pretty to see which plugin consume the most time
Use dissect instead grok, where is possible, probably with few more IFs
Check LS and ES logs, maybe you have delay or issue somewhere
Try to test FB->LS connectivity without anything in the filter section and large files, just to see is there any network or settings issues. You can use LS output in a file.
Maybe obvious to others, but not to me, what is this image showing. What is the y-axis ? Even, assuming the x-axis is a timestamp of some sort, since the thread involves a few, which one is it?
does not seem to sit with:
whats the specific evidence that "Filebeat is struggling to connect to Logstash" ?
What about rotating the log files so that they don't get so large? Though did you not say earlier in the thread the log files were rotating rapidly? Is that rotating rapidly (specifically, how rapidly?) AND quickly reaching sizes of 100MB?
2026-03-02T09:49:45.335-0600 ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: write tcp IP:PORT->LOGSTASHIP:PORT: write: connection reset by peer
2026-03-02T09:49:45.335-0600 INFO [publisher] pipeline/retry.go:219 retryer: send unwait signal to consumer
2026-03-02T09:49:45.335-0600 INFO [publisher] pipeline/retry.go:223 done
2026-03-02T09:49:45.392-0600 ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected
2026-03-02T09:49:45.392-0600 INFO [publisher] pipeline/retry.go:219 retryer: send unwait signal to consumer
2026-03-02T09:49:45.392-0600 INFO [publisher] pipeline/retry.go:223 done
2026-03-02T09:49:45.449-0600 ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected
2026-03-02T09:49:45.449-0600 INFO [publisher] pipeline/retry.go:219 retryer: send unwait signal to consumer
2026-03-02T09:49:45.449-0600 INFO [publisher] pipeline/retry.go:223 done
2026-03-02T09:49:46.505-0600 ERROR [publisher_pipeline_output] pipeline/output.go:180 failed to publish events: client is not connected
2026-03-02T09:49:46.646-0600 ERROR [publisher_pipeline_output] pipeline/output.go:180 failed to publish events: write tcp IP:PORT->LOGSTASHIP:PORT: write: connection reset by peer
2026-03-02T09:49:46.896-0600 ERROR [publisher_pipeline_output] pipeline/output.go:180 failed to publish events: client is not connected
But what does it tell you/us? if the log volume (from applications) varies over time, so will a view like that in kibana. That's normal, that bare graph shows nothing really. Remember we have close to no idea on your use case, we know only what you tell us.
To see an actual issue, a discrepancy, you should try to share graphs that highlight/illustrate the issue. e.g. 2 plots of log volume over time, one using say the application timestamp, and the other using say ls_timestamp. Generally, when things are working fine, these will be pretty much identical shapes. In your case they might be quite different.
telnet is a bit of a blast from the past
Did you try capturing/analysing the traffic with wireshark/tpcdump/similar? Last time I looked at something like this logstash was sending a RSET, but I would be a bit surprised if logstash would close a busy, active, non-idle TCP connection without logging somewhere why it did so.
Are logstash and elasticsearch sharing the same network interface(s)?
Yep, thanks for the help, i don't have a graphic to show you, but every day from 9 AM to 12 PM, logs start to disappear. Talking to some coworkers, they say this time window has a lot of traffic and is the period of peak demand during the day. I believe this happens because Elastic is stalling or lacks sufficient resources to handle the load; I am concluding this based on the next log:
Yes, concerning. I note your error shows a timestamp of
2026-03-04T10:37:18
but has
"event"=>{"original"=>"20260303082500
which looks a bit like 2026-03-03T08:25:00, i.e. almost a full day before?
Assuming you have enabled monitoring within elasticsearch, you should be able to see clear indications of higher load, and how load varies over time generally, under Stack Monitoring in kibana.
Can you also detail the storage you have attached to the server, i.e. specifically what type is it? Answers like "we have a RAID5 volume composed of 8 x 2TB locally attached SSD disks", Or non-RAID NVMe storage. Or older style spinning/HDD disk(s). and an idea of per/second (or per-anything) data volume.
We have a locally attached hardware RAID volume (exact RAID level and physical disk count abstracted from the OS by a Dell PERC H730P Mini controller) presenting a 1.8TB virtual disk composed of traditional HDDs:
lsblk -d -o name,rota
NAME ROTA
sda 1
cat /sys/block/sda/device/model
PERC H730P Mini
cat /sys/block/sda/device/vendor
DELL
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.