Hi Elastic team,
We have about 300 pipelines setup which have JDBC inputs that fetches data from different databases landed on a central SQL server.
All of the JDBC inputs have a schedule set on */5 * * * * (every 5 minutes).
The problem is as follows:
As I monitored the logstash logs and events rate (sourced from Logstash API and applying differences function), When a pipeline has no events on stage (naturally because the input has no new data), the configured schedule intervals runs regularly. But as soon as the input starts reading data, the schedule intervals meet delays. Please see the dashboards below for clarification:
Pipelines events rate history _ the blue bar is the sample watching pipeline with events on stage
Based on the pictures above, the pipeline input query is running every 27 mins (but the schedule is every 5 mins!). Even I checked the query running time from Logstash logs and it's just a few seconds. Please see picture below:
Inspecting log's original message _ the run-time field query_time is extracted from the original log message
What is the actual reason which causes the scheduled JDBC input meets delay (27 minutes), although the logs show that the query time from the source is normal (a few seconds) ?
Is it possible that the filter and output plugins in pipeline causes the delays for input ??! I mean does the queuing plan (filter _ output stages) of pipeline could cause delays in input stage?
My understanding is that the JDBC Input scheduler only uses a single worker thread -- that is that it will not start new work until the existing thread has finished work, even if the scheduled time has elapsed.
The JDBC input cannot push more events into the pipeline if the filter and output are currently working on a batch (blocked) -- i.e. the Logstash pipeline (including the input) should block if the output or filter steps are constrained.
If it is taking 27 minutes to process the batch that is read over JDBC then this would make sense as the behavior you're seeing.
Logstash pipeline diagnostics may help you identify where the bottleneck is (likely on the output to Elasticsearch). Tuning the number of pipeline workers or the pipeline batch size might be advisable.
Thanks for your response. Could also setting jdbc_fetch_size parameter have any impact on the pipeline performance for large result sets? I mean increasing it rather than the default value!
Unfortunately without knowing more it's hard to say what would help. Looking at the pipeline diagnostics would tell you where it's spending most of its time. Once you know whether the majority of the time is in the input, filter, or output you can then tune the appropriate area for additional performance.
Are you saying that you have a single logstash instance which has 300+ pipelines defined, each one with separate jdbc input running every 5 minutes?
You show that one of those pipelines is executing every 27 minutes, with a query execution time of a second or two. But then your first chart seems to show that it also continuously delivering about a million events every 20 minutes. Could it be that the query, although fast, is delivering a huge result set and it takes all that time for those events to be processed?
I would be tempted to start looking at JVM thread dumps to try to see what that input thread is doing.
It's two logstash instances each one handling subset of pipelines. Yes as I said the query time was quick but the delay was likely sourced from filter and output plugins
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.