Logstash/Filebeat lag issue - Logs delayed by hours

Just to check I’m understanding , the Kibana screenshot shows 2 Elasticsearch documents with the same @tinestamp value. They have slightly (45 seconds) later ls_timestamp values, which were generated as they passed through logstash. And a message field that looks to be constructed from some date/time value.

So where precisely is the @timestamp coming from? Is filebeat adding it? And where precisely is the date/time value in message coming from ?

I once had an issue where I was confused with sequencing of timestamps. And it turned out the timestamp in the application log file was the time the random mobile device thought it was, and consumer devices are just time and dare screwed in a myriad of ways!! That meant that “timestamp” had to be (broadly) ignored.

You are spot on. The @timestamp is coming from Filebeat, it records the exact moment Filebeat reads and harvests that specific log line from the file. The date/time in the message field is coming directly from the application itself (our Tomcat and SMPP apps) at the exact millisecond the event occurred and was written to the local log file, the delay you see between the message time, the @timestamp (Filebeat read time), and the ls_timestamp (Logstash processing time) is simply the visual evidence of the processing queue backlog and transit time we were experiencing

The difference between the @timestamp coming from Filebeat and the ls_timestamp generated are small, just some seconds.

In the screenshot you shared you have 2 events, in both the @timestamp and ls_timestamp are pretty close, but the time in the message field in one of them the timestamp is something like 2026-02-24 10:45:45, which is also pretty close, but in the other you have 2026-02-24 06:04:08 which has a 4 hour difference from the other times.

I think this suggests that the backlog is on Filebeat side, not on Logstash, as you mentioned you logs rotate pretty fast, this can cause this backlog if I'm not wrong.

Can you share a print with the same 2 events again, but include the filebeat offset and path fields?

I would make a change in your Filebeat output, since you have only one Logstash, remove the other ports, use just one port and increase the number of workers, use something like 8 or 12.

Well, as well as all that, and I also though a delay of 45 seconds wasn’t particularly remarkable, where I was going is that you don’t have evidence of precisely when a specific log entry is written to the log file. And the 2 application timestamps for those 2 entries was just a bit weird. One a few seconds before ls_timestamp and one ages before.

is there any solution identified for this issue?

We are facing the same issue and not sure where to check.

Hi @Cesar_Mejia -

Could you please confirm if your Filebeat/ logstash are on same data center or they are on different and far data cenetr?

where logstash is saving the logs?

do you think this may be back pressure issue because of network latency?

I disabled pipelining and increased the pool to 8 workers. I've noticed the bottleneck consistently occurs with larger files. While most files are only a few KB or MB, others reach up to 100MB, which seems to be causing the slowdown.

They’re in the same data center, and Logstash is saving to a volume group. and for the network latency I've noticed a pattern where logs aren't missing, but they are arriving with significant latency because Filebeat is struggling to connect to Logstash, even with a timeout of 10000

image

Sorry it took me a bit. Here’s a capture of another similar case

22M /var/log/smpp89_20260227.log
326M /var/log/smpp72_20260227.log

Something is not OK, maybe LS cannot process and respond fast enough. Have you checked LS CPU utilization?
Have you try to do output FB to a local file or directly to ES, just for a test?

Actually your batch is quite big size: pipeline.batch.size: 2048, set it to default value.

pipeline.workers: 12 - no need to set, This defaults to the number of the host's CPU cores.

Add comments rubydebug to increase performances.

#stdout {
#    codec =>  rubydebug {
#    metadata => true
#    }
#  }

Hi Rios!!

Just a quick note: Logstash and Elasticsearch share the same server. I limit the workers manually so Logstash doesn't grab all the cores and starve ES. Also, the server CPU is only at 20-30%, so Logstash isn't using its full capacity.

I noticed the backlog issue is specifically happening with the massive files (>200MB).

And yes, I already commented out stdout and the rubydebug

Few more things:

  • You have too many GREEDYDATA in the middle message, replace with DATA, should be faster
  • Add id => "grok" in grok and id => "elasticsearch" in the output. You can also add in beats. Check LS statistics with url http://localhost:9600/_node/stats/pipelines?pretty to see which plugin consume the most time
  • Use dissect instead grok, where is possible, probably with few more IFs
  • Check LS and ES logs, maybe you have delay or issue somewhere
  • Try to test FB->LS connectivity without anything in the filter section and large files, just to see is there any network or settings issues. You can use LS output in a file.

This is not ideal... And though you may think they are not competing for resources they can be.

Maybe obvious to others, but not to me, what is this image showing. What is the y-axis ? Even, assuming the x-axis is a timestamp of some sort, since the thread involves a few, which one is it?

does not seem to sit with:

whats the specific evidence that "Filebeat is struggling to connect to Logstash" ?

What about rotating the log files so that they don't get so large? Though did you not say earlier in the thread the log files were rotating rapidly? Is that rotating rapidly (specifically, how rapidly?) AND quickly reaching sizes of 100MB?

is the logs amount

filebeat log shows this, ping and telnet works

2026-03-02T09:49:45.335-0600    ERROR   [logstash]      logstash/async.go:280   Failed to publish events caused by: write tcp IP:PORT->LOGSTASHIP:PORT: write: connection reset by peer
2026-03-02T09:49:45.335-0600    INFO    [publisher]     pipeline/retry.go:219   retryer: send unwait signal to consumer
2026-03-02T09:49:45.335-0600    INFO    [publisher]     pipeline/retry.go:223     done
2026-03-02T09:49:45.392-0600    ERROR   [logstash]      logstash/async.go:280   Failed to publish events caused by: client is not connected
2026-03-02T09:49:45.392-0600    INFO    [publisher]     pipeline/retry.go:219   retryer: send unwait signal to consumer
2026-03-02T09:49:45.392-0600    INFO    [publisher]     pipeline/retry.go:223     done
2026-03-02T09:49:45.449-0600    ERROR   [logstash]      logstash/async.go:280   Failed to publish events caused by: client is not connected
2026-03-02T09:49:45.449-0600    INFO    [publisher]     pipeline/retry.go:219   retryer: send unwait signal to consumer
2026-03-02T09:49:45.449-0600    INFO    [publisher]     pipeline/retry.go:223     done
2026-03-02T09:49:46.505-0600    ERROR   [publisher_pipeline_output]     pipeline/output.go:180  failed to publish events: client is not connected
2026-03-02T09:49:46.646-0600    ERROR   [publisher_pipeline_output]     pipeline/output.go:180  failed to publish events: write tcp IP:PORT->LOGSTASHIP:PORT: write: connection reset by peer
2026-03-02T09:49:46.896-0600    ERROR   [publisher_pipeline_output]     pipeline/output.go:180  failed to publish events: client is not connected

it rotates every 300MB

But what does it tell you/us? if the log volume (from applications) varies over time, so will a view like that in kibana. That's normal, that bare graph shows nothing really. Remember we have close to no idea on your use case, we know only what you tell us.

To see an actual issue, a discrepancy, you should try to share graphs that highlight/illustrate the issue. e.g. 2 plots of log volume over time, one using say the application timestamp, and the other using say ls_timestamp. Generally, when things are working fine, these will be pretty much identical shapes. In your case they might be quite different.

telnet is a bit of a blast from the past :slight_smile:

Did you try capturing/analysing the traffic with wireshark/tpcdump/similar? Last time I looked at something like this logstash was sending a RSET, but I would be a bit surprised if logstash would close a busy, active, non-idle TCP connection without logging somewhere why it did so.

Are logstash and elasticsearch sharing the same network interface(s)?

Did you try logstash running elsewhere?

1 Like

@Cesar_Mejia this was quite an interesting puzzle/thread - any progress/update ?

Yep, thanks for the help, i don't have a graphic to show you, but every day from 9 AM to 12 PM, logs start to disappear. Talking to some coworkers, they say this time window has a lot of traffic and is the period of peak demand during the day. I believe this happens because Elastic is stalling or lacks sufficient resources to handle the load; I am concluding this based on the next log:

[2026-03-04T10:37:18,981][INFO ][logstash.outputs.elasticsearch][main]Retrying failed action {:status=>429, :action=>["index", {:_id=>nil, :_index=>"index_reports", :routing=>nil}, {"host"=>{"name"=>"machine01"}, "event"=>{"original"=>"20260303082500: SOME INFORMATION"},  error=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of primary operation [coordinating_and_primary_bytes=50280130, replica_bytes=862626885, all_bytes=912907015, primary_operation_bytes=180245, max_primary_bytes=825019596]"}}
1 Like

Yes, concerning. I note your error shows a timestamp of

2026-03-04T10:37:18

but has

"event"=>{"original"=>"20260303082500

which looks a bit like 2026-03-03T08:25:00, i.e. almost a full day before?

Assuming you have enabled monitoring within elasticsearch, you should be able to see clear indications of higher load, and how load varies over time generally, under Stack Monitoring in kibana.

Can you also detail the storage you have attached to the server, i.e. specifically what type is it? Answers like "we have a RAID5 volume composed of 8 x 2TB locally attached SSD disks", Or non-RAID NVMe storage. Or older style spinning/HDD disk(s). and an idea of per/second (or per-anything) data volume.

i have three elastic instances to send data from logstash with loadbalancing activated:

We have a locally attached hardware RAID volume (exact RAID level and physical disk count abstracted from the OS by a Dell PERC H730P Mini controller) presenting a 1.8TB virtual disk composed of traditional HDDs:

lsblk -d -o name,rota
NAME ROTA
sda     1

cat /sys/block/sda/device/model
PERC H730P Mini

cat /sys/block/sda/device/vendor
DELL